* [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20
@ 2010-03-19 0:59 Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
` (17 more replies)
0 siblings, 18 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw)
To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan
Hi,
Following Andreas Dilger's reply (http://lkml.org/lkml/2010/3/17/410)
I'm (re)posting the subset of checkpoint-restart patch-set that is
related to linux-fsdevel. (I'm unsure why those weren't sent before).
Altogether there are 17 patches here (out of the 96 total).
For the original post/thread see: http://lkml.org/lkml/2010/3/17/232.
As Matt Helsley put briefly, checkpoint-restart mainly saves the
critical pieces of kernel information from the struct file needed to
restart the open file descriptors. It does not save the file (system)
contents in the checkpoint image. That's left for proper filesystem
freezing, snapshotting, or rsync (for example) depending on the tools
and/or filesystems userspace has chosen.
Oren.
---
Here is the introduction to the original post:
---
Following up on the thread on the checkpoint-restart patch set
(http://lkml.org/lkml/2010/3/1/422), the following series is the
latest checkpoint/restart, based on 2.6.33.
The first 20 patches are cleanups and prepartion for c/r; they
are followed by the actual c/r code.
Please apply to -mm, and let us know if there is any way we can
help.
---
Linux Checkpoint-Restart:
web, wiki: http://www.linux-cr.org
bug track: https://www.linux-cr.org/redmine
The repositories for the project are in:
kernel: http://www.linux-cr.org/git/?p=linux-cr.git;a=summary
user tools: http://www.linux-cr.org/git/?p=user-cr.git;a=summary
tests suite: http://www.linux-cr.org/git/?p=tests-cr.git;a=summary
---
CHANGELOG:
v20 [2010-Mar-16]
BUG FIXES (only)
- [Serge Hallyn] Fix unlabeled restore case
- [Serge Hallyn] Always restore msg_msg label
- [Serge Hallyn] Selinux prevents msgrcv on restore message queues?
- [Serge Hallyn] save_access_regs for self-checkpoint
- [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages
- Fix "scheduling in atomic" while restoring ipc (sem, shm, msg)
- Cleanup: no need to restore perm->{id,key,seq}
- Fix sysvipc=n compile
- Make uts_ns=n compile
- Only use arch_setup_additional_pages() if supported by arch
- Export key symbols to enable c/r from kernel modules
- Avoid crash if incoming object doesn't have .restore
- Replace error_sem with an event completion
- [Serge Hallyn] Change sysctl and default for unprivileged use
- [Nathan Lynch] Use syscall_get_error
- Add entry for checkpoint/restart in MAINTAINERS
[2010-Feb-19] v19
NEW FEATURES
- Support for x86-64 architecture
- Support for c/r of LSM (smack, selinux)
- Support for c/r of task fs_root and pwd
- Support for c/r of epoll
- Support for c/r of eventfd
- Enable C/R while executing over NFS
- Preliminary c/r of mounts namespace
- Add @logfd argument to sys_{checkpoint,restart} prototypes
- Define new api for error and debug logging
- Restart to handle checkpoint images lacking {uts,ipc}-ns
- Refuse to checkpoint if monitoring directories with dnotify
- Refuse to checkpoint if file locks and leases are held
- Refuse to checkpoint files with f_owner
OTHER CHANGES
- Rebase to kernel 2.6.33-rc8
- Settled version of new sys_eclone()
- [Serge Hallyn] Fix potential use-before-set return (vdso)
- Update documentation and examples for new syscalls API (doc)
- [Liu Alexander] Fix typos (doc)
- [Serge Hallyn] Update checkpoint image format (doc)
- [Serge Hallyn] Use ckpt_err() to for bad header values
- sys_{checkpoint,restart} to use ptregs prototype
- Set ctx->errno in do_ckpt_msg() if needed
- Fix up headers so we can munge them for use by userspace
- Multiple fixes to _ckpt_write_err() and friends
- [Matt Helsley] Add cpp definitions for enums
- [Serge Hallyn] Add global section container to image format
- [Matt Helsley] Fix total byte read/write count for large images
- ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr)
- [Serge Hallyn] Use ckpt_err() for arch incompatbilities
- Introduce walk_task_subtree() to iterate through descendants
- Call restore_notify_error for restart (not checkpoint !)
- Make kread/kwrite() abort if CKPT_CTX_ERROR is set
- [Serge Hallyn] Move init_completion(&ctx->complete) to ctx_alloc
- Simplify logic of tracking restarting tasks (->ctx)
- Coordinator kills descendants on failure for proper cleanup
- Prepare descendants needs PTRACE_MODE_ATTACH permissions
- Threads wait for entire thread group before restoring
- Add debug process-tree status during restart
- Fix handling of bogus pid arg to sys_restart
- In reparent_thread() test for PF_RESTARTING on parent
- Keep __u32s in even groups for 32-64 bit compatibility
- Define ckpt_obj_try_fetch
- Disallow zero or negative objref during restart
- Check for valid destructor before calling it (deferqueue)
- Fix false negative of test for unlinked files at checkpoint
- [Serge Hallyn] Rename fs_mnt to root_fs_path
- Restore thread/cpu state early
- Ensure null-termination of file names read from image
- Fix compile warning in restore_open_fname()
- Introduce FOLL_DIRTY to follow_page() for "dirty" pages
- [Serge Hallyn] Checkpoint saved_auxv as u64s
- Export filemap_checkpoint()
- [Serge Hallyn] Disallow checkpoint of tasks with aio requests
- Fix compilation failure when !CONFIG_CHEKCPOINT (regression)
- Expose page write functions
- Do not hold mmap_sem while checkpointing vma's
- Do not hold mmap_sem when reading memory pages on restart
- Move consider_private_page() to mm/memory.c:__get_dirty_page()
- [Serge Hallyn] move destroy_mm into mmap.c and remove size check
- [Serge Hallyn] fill vdso (syscall32_setup_pages) for TIF_IA32/x86_64
- [Serge Hallyn] Fix return value of read_pages_contents()
- [Serge Hallyn] Change m_type to long, not int (ipc)
- Don't free sma if it's an error on restore
- Use task->saves_sigmask and drop task->checkpoint_data
- [Serge Hallyn] Handle saved_sigmask at checkpoint
- Defer restore of blocked signals mask during restart
- Self-restart to tolerate missing PGIDs
- [Serge Hallyn] skb->tail can be offset
- Export and leverage sock_alloc_file()
- [Nathan Lynch] Fix net/checkpoint.c for 64-bit
- [Dan Smith] Unify skb read/write functions and handle fragmented buffers
- [Dan Smith] Update buffer restore code to match the new format
- [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n
- [Dan Smith] Remove an unnecessary check on socket restart
- [Dan Smith] Pass the stored sock->protocol into sock_create() on restore
- Relax tcp.window_clamp value in INET restore
- Restore gso_type fields on sockets and buffers for proper operation
- Fix broken compilation for no-c/r architectures
- Return -EBUSY (not BUG_ON) if fd is gone on restart
- Fix the chunk size instead of auto-tune (epoll)
ARCH: x86 (32,64)
- Use PTREGSCALL4 for sys_{checkpoint,restart}
- Remove debug-reg support (need to redo with perf_events)
- [Serge Hallyn] Support for ia32 (checkpoint, restart)
- Split arch/x86/checkpoint.c to generic and 32bit specific parts
- sys_{checkpoint,restore} to use ptregs
- Allow X86_EFLAGS_RF on restart
- [Serge Hallyn] Only allow 'restart' with same bit-ness as image.
- Move checkpoint.c from arch/x86/mm->arch/x86/kernel
ARCH: s390 [Serge Hallyn]
- Define s390x sys_restart wrapper
- Fixes to restart-blocks logic and signal path
- Fix checkpoint and restart compat wrappers
- sys_{checkpoint,restore} to use ptregs
- Use simpler test_task_thread to test current ti flags
- Fix 31-bit s390 checkpoint/restart wrappers
- Update sys_checkpoint (do_sys_checkpoint on all archs)
- [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel
ARCH: powerpc [Nathan Lynch]
- [Serge Hallyn] Add hook task_has_saved_sigmask()
- Warn if full register state unavailable
- Fix up checkpoint syscall, tidy restart
- [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm->kernel}
[2009-Sep-22] v18
NEW FEATURES
- [Nathan Lynch] Re-introduce powerpc support
- Save/restore pseudo-terminals
- Save/restore (pty) controlling terminals
- Save/restore restore PGIDs
- [Dan Smith] Save/restore unix domain sockets
- Save/restore FIFOs
- Save/restore pending signals
- Save/restore rlimits
- Save/restore itimers
- [Matt Helsley] Handle many non-pseudo file-systems
OTHER CHANGES
- Rename headerless struct ckpt_hdr_* to struct ckpt_*
- [Nathan Lynch] discard const from struct cred * where appropriate
- [Serge Hallyn][s390] Set return value for self-checkpoint
- Handle kmalloc failure in restore_sem_array()
- [IPC] Collect files used by shm objects
- [IPC] Use file (not inode) as shared object on checkpoint of shm
- More ckpt_write_err()s to give information on checkpoint failure
- Adjust format of pipe buffer to include the mandatory pre-header
- [LEAKS] Mark the backing file as visited at chekcpoint
- Tighten checks on supported vma to checkpoint or restart
- [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
- Introduce ckpt_collect_file() that also uses file->collect method
- Use ckpt_collect_file() instead of ckpt_obj_collect() for files
- Fix leak-detection issue in collect_mm() (test for first-time obj)
- Invoke set_close_on_exec() unconditionally on restart
- [Dan Smith] Export fill_fname() as ckpt_fill_fname()
- Interface to pass simple pointers as data with deferqueue
- [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
- Replace EAGAIN with EBUSY where necessary
- Introduce CKPT_OBJ_VISITED in leak detection
- ckpt_obj_collect() returns objref for new objects, 0 otherwise
- Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
- Introduce ckpt_obj_visit() to mark objects as visited
- Set the CHECKPOINTED flag on objects before calling checkpoint
- Introduce ckpt_obj_reserve()
- Change ref_drop() to accept a @lastref argument (for cleanup)
- Disallow multiple objects with same objref in restart
- Allow _ckpt_read_obj_type() to read header only (w/o payload)
- Fix leak of ckpt_ctx when restoring zombie tasks
- Fix race of prepare_descendant() with an ongoing fork()
- Track and report the first error if restart fails
- Tighten logic to protect against bogus pids in input
- [Matt Helsley] Improve debug output from ckpt_notify_error()
- [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
- Detect error-headers in input data on restart, and abort.
- Standard format for checkpoint error strings (and documentation)
- [Dan Smith] Add an errno validation function
- Add ckpt_read_payload(): read a variable-length object (no header)
- Add ckpt_read_string(): same for strings (ensures null-terminated)
- Add ckpt_read_consume(): consumes next object without processing
- [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
[2009-Jul-21] v17
- Introduce syscall clone_with_pids() to restore original pids
- Support threads and zombies
- Save/restore task->files
- Save/restore task->sighand
- Save/restore futex
- Save/restore credentials
- Introduce PF_RESTARTING to skip notifications on task exit
- restart(2) allow caller to ask to freeze tasks after restart
- restart(2) isn't idempotent: return -EINTR if interrupted
- Improve debugging output handling
- Make multi-process restart logic more robust and complete
- Correctly select return value for restarting tasks on success
- Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
- Use CHECKPOINTING state for frozen checkpointed tasks
- Fix compilation without CONFIG_CHECKPOINT
- Fix compilation with CONFIG_COMPAT
- Fix headers includes and exports
- Leak detection performed in two steps
- Detect "inverse" leaks of objects (dis)appearing unexpectedly
- Memory: save/restore mm->{flags,def_flags,saved_auxv}
- Memory: only collect sub-objects of mm once (leak detection)
- Files: validate f_mode after restore
- Namespaces: leak detection for nsproxy sub-components
- Namespaces: proper restart from namespace(s) without namespace(s)
- Save global constants in header instead of per-object
- IPC: replace sys_unshare() with create_ipc_ns()
- IPC: restore objects in suitable namespace
- IPC: correct behavior under !CONFIG_IPC_NS
- UTS: save/restore all fields
- UTS: replace sys_unshare() with create_uts_ns()
- X86_32: sanitize cpu, debug, and segment registers on restart
- cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
- cgroup_freezer: add interface to freeze a cgroup (given a task)
[2009-May-27] v16
- Privilege checks for IPC checkpoint
- Fix error string generation during checkpoint
- Use kzalloc for header allocation
- Restart blocks are arch-independent
- Redo pipe c/r using splice
- Fixes to s390 arch
- Remove powerpc arch (temporary)
- Explicitly restore ->nsproxy
- All objects in image are precedeed by 'struct ckpt_hdr'
- Fix leaks detection (and leaks)
- Reorder of patchset
- Misc bugs and compilation fixes
[2009-Apr-12] v15
- Minor fixes
[2009-Apr-28] v14
- Tested against kernel v2.6.30-rc3 on x86_32.
- Refactor files chekpoint to use f_ops (file operations)
- Refactor mm/vma to use vma_ops
- Explicitly handle VDSO vma (and require compat mode)
- Added code to c/r restat-blocks (restart timeout related syscalls)
- Added code to c/r namespaces: uts, ipc (with Dan Smith)
- Added code to c/r sysvipc (shm, msg, sem)
- Support for VM_CLONE shared memory
- Added resource leak detection for whole-container checkpoint
- Added sysctl gauge to allow unprivileged restart/checkpoint
- Improve and simplify the code and logic of shared objects
- Rework image format: shared objects appear prior to their use
- Merge checkpoint and restart functionality into same files
- Massive renaming of functions: prefix "ckpt_" for generics,
"checkpoint_" for checkpoint, and "restore_" for restart.
- Report checkpoint errors as a valid (string record) in the output
- Merged PPC architecture (by Nathan Lunch),
- Requires updates to userspace tools too.
- Misc nits and bug fixes
[2009-Mar-31] v14-rc2
- Change along Dave's suggestion to use f_ops->checkpoint() for files
- Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
- Merge support for PPC arch (Nathan Lynch)
- Misc cleanups and fixes in response to comments
[2009-Mar-20] v14-rc1:
- The 'h.parent' field of 'struct cr_hdr' isn't used - discard
- Check whether calls to cr_hbuf_get() succeed or fail.
- Fixed of pipe c/r code
- Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
- Refuse non-self checkpoint if a task isn't frozen
- Use unsigned fields in checkpoint headers unless otherwise required
- Rename functions in files c/r to better reflect their role
- Add support for anonymous shared memory
- Merge support for s390 arch (Dan Smith, Serge Hallyn)
[2008-Dec-03] v13:
- Cleanups of 'struct cr_ctx' - remove unused fields
- Misc fixes for comments
[2008-Dec-17] v12:
- Fix re-alloc/reset of pgarr chain to correctly reuse buffers
(empty pgarr are saves in a separate pool chain)
- Add a couple of missed calls to cr_hbuf_put()
- cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
- Split cr_write/cr_read() to two parts: _cr_write/read() helper
- Befriend with sparse: explicit conversion to 'void __user *'
- Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
[2008-Dec-05] v11:
- Use contents of 'init->fs->root' instead of pointing to it
- Ignore symlinks (there is no such thing as an open symlink)
- cr_scan_fds() retries from scratch if it hits size limits
- Add missing test for VM_MAYSHARE when dumping memory
- Improve documentation about: behavior when tasks aren't fronen,
life span of the object hash, references to objects in the hash
[2008-Nov-26] v10:
- Grab vfs root of container init, rather than current process
- Acquire dcache_lock around call to __d_path() in cr_fill_name()
- Force end-of-string in cr_read_string() (fix possible DoS)
- Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
[2008-Nov-10] v9:
- Support multiple processes c/r
- Extend checkpoint header with archtiecture dependent header
- Misc bug fixes (see individual changelogs)
- Rebase to v2.6.28-rc3.
[2008-Oct-29] v8:
- Support "external" checkpoint
- Include Dave Hansen's 'deny-checkpoint' patch
- Split docs in Documentation/checkpoint/..., and improve contents
[2008-Oct-17] v7:
- Fix save/restore state of FPU
- Fix argument given to kunmap_atomic() in memory dump/restore
[2008-Oct-07] v6:
- Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
(even though it's not really needed)
- Add assumptions and what's-missing to documentation
- Misc fixes and cleanups
[2008-Sep-11] v5:
- Config is now 'def_bool n' by default
- Improve memory dump/restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Fix use of follow_page() to avoid faulting in non-present pages
- Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()
- Remove preempt_disable() when restoring debug registers
- Rename headers files s/ckpt/checkpoint/
- Fix misc bugs in files dump/restore
- Fixes and cleanups on some error paths
- Fix misc coding style
[2008-Sep-09] v4:
- Various fixes and clean-ups
- Fix calculation of hash table size
- Fix header structure alignment
- Use stand list_... for cr_pgarr
[2008-Aug-29] v3:
- Various fixes and clean-ups
- Use standard hlist_... for hash table
- Better use of standard kmalloc/kfree
[2008-Aug-20] v2:
- Added Dump and restore of open files (regular and directories)
- Added basic handling of shared objects, and improve handling of
'parent tag' concept
- Added documentation
- Improved ABI, 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup and response to comments for v1
[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.
^ permalink raw reply [flat|nested] 88+ messages in thread* [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan [not found] ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-22 6:31 ` Nick Piggin 2010-03-19 0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan ` (16 subsequent siblings) 17 siblings, 2 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan These two are used in the next patch when calling vfs_read/write() Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> --- fs/read_write.c | 10 ---------- include/linux/fs.h | 10 ++++++++++ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index b7f4a1f..e258301 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ EXPORT_SYMBOL(vfs_write); -static inline loff_t file_pos_read(struct file *file) -{ - return file->f_pos; -} - -static inline void file_pos_write(struct file *file, loff_t pos) -{ - file->f_pos = pos; -} - SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct file *file; diff --git a/include/linux/fs.h b/include/linux/fs.h index ebb1cd5..6c08df2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, struct iovec *fast_pointer, struct iovec **ret_pointer); +static inline loff_t file_pos_read(struct file *file) +{ + return file->f_pos; +} + +static inline void file_pos_write(struct file *file, loff_t pos) +{ + file->f_pos = pos; +} + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_readv(struct file *, const struct iovec __user *, -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
[parent not found: <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public [not found] ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-22 6:31 ` Nick Piggin 0 siblings, 0 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 6:31 UTC (permalink / raw) To: Oren Laadan Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: > These two are used in the next patch when calling vfs_read/write() Said next patch didn't seem to make it to fsdevel. Should it at least go to fs/internal.h? > > Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > --- > fs/read_write.c | 10 ---------- > include/linux/fs.h | 10 ++++++++++ > 2 files changed, 10 insertions(+), 10 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index b7f4a1f..e258301 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ > > EXPORT_SYMBOL(vfs_write); > > -static inline loff_t file_pos_read(struct file *file) > -{ > - return file->f_pos; > -} > - > -static inline void file_pos_write(struct file *file, loff_t pos) > -{ > - file->f_pos = pos; > -} > - > SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) > { > struct file *file; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index ebb1cd5..6c08df2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, > struct iovec *fast_pointer, > struct iovec **ret_pointer); > > +static inline loff_t file_pos_read(struct file *file) > +{ > + return file->f_pos; > +} > + > +static inline void file_pos_write(struct file *file, loff_t pos) > +{ > + file->f_pos = pos; > +} > + > extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); > extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); > extern ssize_t vfs_readv(struct file *, const struct iovec __user *, > -- > 1.6.3.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan [not found] ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-22 6:31 ` Nick Piggin 2010-03-23 0:12 ` Oren Laadan 2010-03-23 0:12 ` Oren Laadan 1 sibling, 2 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 6:31 UTC (permalink / raw) To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: > These two are used in the next patch when calling vfs_read/write() Said next patch didn't seem to make it to fsdevel. Should it at least go to fs/internal.h? > > Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> > Acked-by: Serge E. Hallyn <serue@us.ibm.com> > --- > fs/read_write.c | 10 ---------- > include/linux/fs.h | 10 ++++++++++ > 2 files changed, 10 insertions(+), 10 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index b7f4a1f..e258301 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ > > EXPORT_SYMBOL(vfs_write); > > -static inline loff_t file_pos_read(struct file *file) > -{ > - return file->f_pos; > -} > - > -static inline void file_pos_write(struct file *file, loff_t pos) > -{ > - file->f_pos = pos; > -} > - > SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) > { > struct file *file; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index ebb1cd5..6c08df2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, > struct iovec *fast_pointer, > struct iovec **ret_pointer); > > +static inline loff_t file_pos_read(struct file *file) > +{ > + return file->f_pos; > +} > + > +static inline void file_pos_write(struct file *file, loff_t pos) > +{ > + file->f_pos = pos; > +} > + > extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); > extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); > extern ssize_t vfs_readv(struct file *, const struct iovec __user *, > -- > 1.6.3.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-22 6:31 ` Nick Piggin @ 2010-03-23 0:12 ` Oren Laadan 2010-03-23 0:43 ` Nick Piggin [not found] ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org> 2010-03-23 0:12 ` Oren Laadan 1 sibling, 2 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-23 0:12 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger On Mon, 22 Mar 2010, Nick Piggin wrote: > On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: > > These two are used in the next patch when calling vfs_read/write() > > Said next patch didn't seem to make it to fsdevel. Thanks for reviewing, and sorry about this glitch - see below. > > Should it at least go to fs/internal.h? Sure. So Here is the relevant hunk from said patch (the entire patch is: https://patchwork.kernel.org/patch/86389/): +/* + * Helpers to write(read) from(to) kernel space to(from) the checkpoint + * image file descriptor (similar to how a core-dump is performed). + * + * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image + * ckpt_kread() - read from the checkpoint image to a kernel-space buffer + */ + +static inline int _ckpt_kwrite(struct file *file, void *addr, int count) +{ + void __user *uaddr = (__force void __user *) addr; + ssize_t nwrite; + int nleft; + + for (nleft = count; nleft; nleft -= nwrite) { + loff_t pos = file_pos_read(file); + nwrite = vfs_write(file, uaddr, nleft, &pos); + file_pos_write(file, pos); + if (nwrite < 0) { + if (nwrite == -EAGAIN) + nwrite = 0; + else + return nwrite; + } + uaddr += nwrite; + } + return 0; +} + +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) +{ + mm_segment_t fs; + int ret; + + fs = get_fs(); + set_fs(KERNEL_DS); + ret = _ckpt_kwrite(ctx->file, addr, count); + set_fs(fs); + + ctx->total += count; + return ret; +} + +static inline int _ckpt_kread(struct file *file, void *addr, int count) +{ + void __user *uaddr = (__force void __user *) addr; + ssize_t nread; + int nleft; + + for (nleft = count; nleft; nleft -= nread) { + loff_t pos = file_pos_read(file); + nread = vfs_read(file, uaddr, nleft, &pos); + file_pos_write(file, pos); + if (nread <= 0) { + if (nread == -EAGAIN) { + nread = 0; + continue; + } else if (nread == 0) + nread = -EPIPE; /* unexecpted EOF */ + return nread; + } + uaddr += nread; + } + return 0; +} + +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) +{ + mm_segment_t fs; + int ret; + + fs = get_fs(); + set_fs(KERNEL_DS); + ret = _ckpt_kread(ctx->file , addr, count); + set_fs(fs); + + ctx->total += count; + return ret; +} Oren. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-23 0:12 ` Oren Laadan @ 2010-03-23 0:43 ` Nick Piggin 2010-03-23 0:56 ` Oren Laadan 2010-03-23 0:56 ` Oren Laadan [not found] ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org> 1 sibling, 2 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-23 0:43 UTC (permalink / raw) To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote: > On Mon, 22 Mar 2010, Nick Piggin wrote: > > > On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: > > > These two are used in the next patch when calling vfs_read/write() > > > > Said next patch didn't seem to make it to fsdevel. > > Thanks for reviewing, and sorry about this glitch - see below. > > > > > Should it at least go to fs/internal.h? > > Sure. > > So Here is the relevant hunk from said patch (the entire > patch is: https://patchwork.kernel.org/patch/86389/): > > +/* > + * Helpers to write(read) from(to) kernel space to(from) the checkpoint > + * image file descriptor (similar to how a core-dump is performed). > + * > + * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image > + * ckpt_kread() - read from the checkpoint image to a kernel-space buffer Hmm, OK. Slightly-more-write(2) type of write. fs/splice.c code also has a kernel_write and readv. Not sure if there is any other common code. But maybe it would be better to put together some useful helpers under fs/ rather than a ckpt specific thing. > + */ > + > +static inline int _ckpt_kwrite(struct file *file, void *addr, int count) > +{ > + void __user *uaddr = (__force void __user *) addr; > + ssize_t nwrite; > + int nleft; > + > + for (nleft = count; nleft; nleft -= nwrite) { > + loff_t pos = file_pos_read(file); > + nwrite = vfs_write(file, uaddr, nleft, &pos); > + file_pos_write(file, pos); > + if (nwrite < 0) { > + if (nwrite == -EAGAIN) > + nwrite = 0; > + else > + return nwrite; > + } > + uaddr += nwrite; > + } > + return 0; > +} > + > +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) > +{ > + mm_segment_t fs; > + int ret; > + > + fs = get_fs(); > + set_fs(KERNEL_DS); > + ret = _ckpt_kwrite(ctx->file, addr, count); > + set_fs(fs); > + > + ctx->total += count; > + return ret; > +} > + > +static inline int _ckpt_kread(struct file *file, void *addr, int count) > +{ > + void __user *uaddr = (__force void __user *) addr; > + ssize_t nread; > + int nleft; > + > + for (nleft = count; nleft; nleft -= nread) { > + loff_t pos = file_pos_read(file); > + nread = vfs_read(file, uaddr, nleft, &pos); > + file_pos_write(file, pos); > + if (nread <= 0) { > + if (nread == -EAGAIN) { > + nread = 0; > + continue; > + } else if (nread == 0) > + nread = -EPIPE; /* unexecpted EOF */ > + return nread; > + } > + uaddr += nread; > + } > + return 0; > +} > + > +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) > +{ > + mm_segment_t fs; > + int ret; > + > + fs = get_fs(); > + set_fs(KERNEL_DS); > + ret = _ckpt_kread(ctx->file , addr, count); > + set_fs(fs); > + > + ctx->total += count; > + return ret; > +} > > Oren. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-23 0:43 ` Nick Piggin @ 2010-03-23 0:56 ` Oren Laadan 2010-03-23 0:56 ` Oren Laadan 1 sibling, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-23 0:56 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger Nick Piggin wrote: > On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote: >> On Mon, 22 Mar 2010, Nick Piggin wrote: >> >>> On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: >>>> These two are used in the next patch when calling vfs_read/write() >>> Said next patch didn't seem to make it to fsdevel. >> Thanks for reviewing, and sorry about this glitch - see below. >> >>> Should it at least go to fs/internal.h? >> Sure. >> >> So Here is the relevant hunk from said patch (the entire >> patch is: https://patchwork.kernel.org/patch/86389/): >> >> +/* >> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint >> + * image file descriptor (similar to how a core-dump is performed). >> + * >> + * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image >> + * ckpt_kread() - read from the checkpoint image to a kernel-space buffer > > Hmm, OK. Slightly-more-write(2) type of write. > > fs/splice.c code also has a kernel_write and readv. Not sure if there is > any other common code. But maybe it would be better to put together some > useful helpers under fs/ rather than a ckpt specific thing. Right. Another place is fs/exec.c that provides kernel_read(). I'll put the common code in kernel/read_write.c then. Oren. > >> + */ >> + >> +static inline int _ckpt_kwrite(struct file *file, void *addr, int count) >> +{ >> + void __user *uaddr = (__force void __user *) addr; >> + ssize_t nwrite; >> + int nleft; >> + >> + for (nleft = count; nleft; nleft -= nwrite) { >> + loff_t pos = file_pos_read(file); >> + nwrite = vfs_write(file, uaddr, nleft, &pos); >> + file_pos_write(file, pos); >> + if (nwrite < 0) { >> + if (nwrite == -EAGAIN) >> + nwrite = 0; >> + else >> + return nwrite; >> + } >> + uaddr += nwrite; >> + } >> + return 0; >> +} >> + >> +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) >> +{ >> + mm_segment_t fs; >> + int ret; >> + >> + fs = get_fs(); >> + set_fs(KERNEL_DS); >> + ret = _ckpt_kwrite(ctx->file, addr, count); >> + set_fs(fs); >> + >> + ctx->total += count; >> + return ret; >> +} >> + >> +static inline int _ckpt_kread(struct file *file, void *addr, int count) >> +{ >> + void __user *uaddr = (__force void __user *) addr; >> + ssize_t nread; >> + int nleft; >> + >> + for (nleft = count; nleft; nleft -= nread) { >> + loff_t pos = file_pos_read(file); >> + nread = vfs_read(file, uaddr, nleft, &pos); >> + file_pos_write(file, pos); >> + if (nread <= 0) { >> + if (nread == -EAGAIN) { >> + nread = 0; >> + continue; >> + } else if (nread == 0) >> + nread = -EPIPE; /* unexecpted EOF */ >> + return nread; >> + } >> + uaddr += nread; >> + } >> + return 0; >> +} >> + >> +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) >> +{ >> + mm_segment_t fs; >> + int ret; >> + >> + fs = get_fs(); >> + set_fs(KERNEL_DS); >> + ret = _ckpt_kread(ctx->file , addr, count); >> + set_fs(fs); >> + >> + ctx->total += count; >> + return ret; >> +} >> >> Oren. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-23 0:43 ` Nick Piggin 2010-03-23 0:56 ` Oren Laadan @ 2010-03-23 0:56 ` Oren Laadan 1 sibling, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-23 0:56 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger Nick Piggin wrote: > On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote: >> On Mon, 22 Mar 2010, Nick Piggin wrote: >> >>> On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: >>>> These two are used in the next patch when calling vfs_read/write() >>> Said next patch didn't seem to make it to fsdevel. >> Thanks for reviewing, and sorry about this glitch - see below. >> >>> Should it at least go to fs/internal.h? >> Sure. >> >> So Here is the relevant hunk from said patch (the entire >> patch is: https://patchwork.kernel.org/patch/86389/): >> >> +/* >> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint >> + * image file descriptor (similar to how a core-dump is performed). >> + * >> + * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image >> + * ckpt_kread() - read from the checkpoint image to a kernel-space buffer > > Hmm, OK. Slightly-more-write(2) type of write. > > fs/splice.c code also has a kernel_write and readv. Not sure if there is > any other common code. But maybe it would be better to put together some > useful helpers under fs/ rather than a ckpt specific thing. Right. Another place is fs/exec.c that provides kernel_read(). I'll put the common code in kernel/read_write.c then. Oren. > >> + */ >> + >> +static inline int _ckpt_kwrite(struct file *file, void *addr, int count) >> +{ >> + void __user *uaddr = (__force void __user *) addr; >> + ssize_t nwrite; >> + int nleft; >> + >> + for (nleft = count; nleft; nleft -= nwrite) { >> + loff_t pos = file_pos_read(file); >> + nwrite = vfs_write(file, uaddr, nleft, &pos); >> + file_pos_write(file, pos); >> + if (nwrite < 0) { >> + if (nwrite == -EAGAIN) >> + nwrite = 0; >> + else >> + return nwrite; >> + } >> + uaddr += nwrite; >> + } >> + return 0; >> +} >> + >> +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) >> +{ >> + mm_segment_t fs; >> + int ret; >> + >> + fs = get_fs(); >> + set_fs(KERNEL_DS); >> + ret = _ckpt_kwrite(ctx->file, addr, count); >> + set_fs(fs); >> + >> + ctx->total += count; >> + return ret; >> +} >> + >> +static inline int _ckpt_kread(struct file *file, void *addr, int count) >> +{ >> + void __user *uaddr = (__force void __user *) addr; >> + ssize_t nread; >> + int nleft; >> + >> + for (nleft = count; nleft; nleft -= nread) { >> + loff_t pos = file_pos_read(file); >> + nread = vfs_read(file, uaddr, nleft, &pos); >> + file_pos_write(file, pos); >> + if (nread <= 0) { >> + if (nread == -EAGAIN) { >> + nread = 0; >> + continue; >> + } else if (nread == 0) >> + nread = -EPIPE; /* unexecpted EOF */ >> + return nread; >> + } >> + uaddr += nread; >> + } >> + return 0; >> +} >> + >> +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) >> +{ >> + mm_segment_t fs; >> + int ret; >> + >> + fs = get_fs(); >> + set_fs(KERNEL_DS); >> + ret = _ckpt_kread(ctx->file , addr, count); >> + set_fs(fs); >> + >> + ctx->total += count; >> + return ret; >> +} >> >> Oren. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>]
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public [not found] ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org> @ 2010-03-23 0:43 ` Nick Piggin 0 siblings, 0 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-23 0:43 UTC (permalink / raw) To: Oren Laadan Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote: > On Mon, 22 Mar 2010, Nick Piggin wrote: > > > On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: > > > These two are used in the next patch when calling vfs_read/write() > > > > Said next patch didn't seem to make it to fsdevel. > > Thanks for reviewing, and sorry about this glitch - see below. > > > > > Should it at least go to fs/internal.h? > > Sure. > > So Here is the relevant hunk from said patch (the entire > patch is: https://patchwork.kernel.org/patch/86389/): > > +/* > + * Helpers to write(read) from(to) kernel space to(from) the checkpoint > + * image file descriptor (similar to how a core-dump is performed). > + * > + * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image > + * ckpt_kread() - read from the checkpoint image to a kernel-space buffer Hmm, OK. Slightly-more-write(2) type of write. fs/splice.c code also has a kernel_write and readv. Not sure if there is any other common code. But maybe it would be better to put together some useful helpers under fs/ rather than a ckpt specific thing. > + */ > + > +static inline int _ckpt_kwrite(struct file *file, void *addr, int count) > +{ > + void __user *uaddr = (__force void __user *) addr; > + ssize_t nwrite; > + int nleft; > + > + for (nleft = count; nleft; nleft -= nwrite) { > + loff_t pos = file_pos_read(file); > + nwrite = vfs_write(file, uaddr, nleft, &pos); > + file_pos_write(file, pos); > + if (nwrite < 0) { > + if (nwrite == -EAGAIN) > + nwrite = 0; > + else > + return nwrite; > + } > + uaddr += nwrite; > + } > + return 0; > +} > + > +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) > +{ > + mm_segment_t fs; > + int ret; > + > + fs = get_fs(); > + set_fs(KERNEL_DS); > + ret = _ckpt_kwrite(ctx->file, addr, count); > + set_fs(fs); > + > + ctx->total += count; > + return ret; > +} > + > +static inline int _ckpt_kread(struct file *file, void *addr, int count) > +{ > + void __user *uaddr = (__force void __user *) addr; > + ssize_t nread; > + int nleft; > + > + for (nleft = count; nleft; nleft -= nread) { > + loff_t pos = file_pos_read(file); > + nread = vfs_read(file, uaddr, nleft, &pos); > + file_pos_write(file, pos); > + if (nread <= 0) { > + if (nread == -EAGAIN) { > + nread = 0; > + continue; > + } else if (nread == 0) > + nread = -EPIPE; /* unexecpted EOF */ > + return nread; > + } > + uaddr += nread; > + } > + return 0; > +} > + > +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) > +{ > + mm_segment_t fs; > + int ret; > + > + fs = get_fs(); > + set_fs(KERNEL_DS); > + ret = _ckpt_kread(ctx->file , addr, count); > + set_fs(fs); > + > + ctx->total += count; > + return ret; > +} > > Oren. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-22 6:31 ` Nick Piggin 2010-03-23 0:12 ` Oren Laadan @ 2010-03-23 0:12 ` Oren Laadan 1 sibling, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-23 0:12 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Mon, 22 Mar 2010, Nick Piggin wrote: > On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote: > > These two are used in the next patch when calling vfs_read/write() > > Said next patch didn't seem to make it to fsdevel. Thanks for reviewing, and sorry about this glitch - see below. > > Should it at least go to fs/internal.h? Sure. So Here is the relevant hunk from said patch (the entire patch is: https://patchwork.kernel.org/patch/86389/): +/* + * Helpers to write(read) from(to) kernel space to(from) the checkpoint + * image file descriptor (similar to how a core-dump is performed). + * + * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image + * ckpt_kread() - read from the checkpoint image to a kernel-space buffer + */ + +static inline int _ckpt_kwrite(struct file *file, void *addr, int count) +{ + void __user *uaddr = (__force void __user *) addr; + ssize_t nwrite; + int nleft; + + for (nleft = count; nleft; nleft -= nwrite) { + loff_t pos = file_pos_read(file); + nwrite = vfs_write(file, uaddr, nleft, &pos); + file_pos_write(file, pos); + if (nwrite < 0) { + if (nwrite == -EAGAIN) + nwrite = 0; + else + return nwrite; + } + uaddr += nwrite; + } + return 0; +} + +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) +{ + mm_segment_t fs; + int ret; + + fs = get_fs(); + set_fs(KERNEL_DS); + ret = _ckpt_kwrite(ctx->file, addr, count); + set_fs(fs); + + ctx->total += count; + return ret; +} + +static inline int _ckpt_kread(struct file *file, void *addr, int count) +{ + void __user *uaddr = (__force void __user *) addr; + ssize_t nread; + int nleft; + + for (nleft = count; nleft; nleft -= nread) { + loff_t pos = file_pos_read(file); + nread = vfs_read(file, uaddr, nleft, &pos); + file_pos_write(file, pos); + if (nread <= 0) { + if (nread == -EAGAIN) { + nread = 0; + continue; + } else if (nread == 0) + nread = -EPIPE; /* unexecpted EOF */ + return nread; + } + uaddr += nread; + } + return 0; +} + +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) +{ + mm_segment_t fs; + int ret; + + fs = get_fs(); + set_fs(KERNEL_DS); + ret = _ckpt_kread(ctx->file , addr, count); + set_fs(fs); + + ctx->total += count; + return ret; +} Oren. ^ permalink raw reply [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan [not found] ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-22 6:34 ` Nick Piggin 2010-03-19 0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan ` (15 subsequent siblings) 17 siblings, 2 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan While we assume all normal files and directories can be checkpointed, there are, as usual in the VFS, specialized places that will always need an ability to override these defaults. Although we could do this completely in the checkpoint code, that would bitrot quickly. This adds a new 'file_operations' function for checkpointing a file. It is assumed that there should be a dirt-simple way to make something (un)checkpointable that fits in with current code. As you can see in the ext[234] patches down the road, all that we have to do to make something simple be supported is add a single "generic" f_op entry. Also adds a new 'file_operations' function for 'collecting' a file for leak-detection during full-container checkpoint. This is useful for those files that hold references to other "collectable" objects. Two examples are pty files that point to corresponding tty objects, and eventpoll files that refer to the files they are monitoring. Finally, this patch introduces vfs_fcntl() so that it can be called from restart (see patch adding restart of files). Changelog[v17] - Introduce 'collect' method Changelog[v17] - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- fs/fcntl.c | 21 +++++++++++++-------- include/linux/fs.h | 7 +++++++ 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 97e01dc..e1f02ca 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, return err; } +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) +{ + int err; + + err = security_file_fcntl(filp, cmd, arg); + if (err) + goto out; + err = do_fcntl(fd, cmd, arg, filp); + out: + return err; +} + SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) { struct file *filp; @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) if (!filp) goto out; - err = security_file_fcntl(filp, cmd, arg); - if (err) { - fput(filp); - return err; - } - - err = do_fcntl(fd, cmd, arg, filp); - + err = vfs_fcntl(fd, cmd, arg, filp); fput(filp); out: return err; diff --git a/include/linux/fs.h b/include/linux/fs.h index 6c08df2..65ebec5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -394,6 +394,7 @@ struct kstatfs; struct vm_area_struct; struct vfsmount; struct cred; +struct ckpt_ctx; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -1093,6 +1094,8 @@ struct file_lock { #include <linux/fcntl.h> +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); + extern void send_sigio(struct fown_struct *fown, int fd, int band); #ifdef CONFIG_FILE_LOCKING @@ -1504,6 +1507,8 @@ struct file_operations { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); + int (*checkpoint)(struct ckpt_ctx *, struct file *); + int (*collect)(struct ckpt_ctx *, struct file *); }; struct inode_operations { @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#define generic_file_checkpoint NULL + extern int vfs_readdir(struct file *, filldir_t, void *); extern int vfs_stat(char __user *, struct kstat *); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
[parent not found: <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() [not found] ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-22 6:34 ` Nick Piggin 0 siblings, 0 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 6:34 UTC (permalink / raw) To: Oren Laadan Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote: > While we assume all normal files and directories can be checkpointed, > there are, as usual in the VFS, specialized places that will always > need an ability to override these defaults. Although we could do this > completely in the checkpoint code, that would bitrot quickly. > > This adds a new 'file_operations' function for checkpointing a file. > It is assumed that there should be a dirt-simple way to make something > (un)checkpointable that fits in with current code. > > As you can see in the ext[234] patches down the road, all that we have > to do to make something simple be supported is add a single "generic" > f_op entry. > > Also adds a new 'file_operations' function for 'collecting' a file for > leak-detection during full-container checkpoint. This is useful for > those files that hold references to other "collectable" objects. Two > examples are pty files that point to corresponding tty objects, and > eventpoll files that refer to the files they are monitoring. > > Finally, this patch introduces vfs_fcntl() so that it can be called > from restart (see patch adding restart of files). > > Changelog[v17] > - Introduce 'collect' method > Changelog[v17] > - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h > > Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > --- > fs/fcntl.c | 21 +++++++++++++-------- > include/linux/fs.h | 7 +++++++ > 2 files changed, 20 insertions(+), 8 deletions(-) > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 97e01dc..e1f02ca 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > return err; > } > > +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) > +{ > + int err; > + > + err = security_file_fcntl(filp, cmd, arg); > + if (err) > + goto out; > + err = do_fcntl(fd, cmd, arg, filp); > + out: > + return err; > +} > + > SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > { > struct file *filp; > @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > if (!filp) > goto out; > > - err = security_file_fcntl(filp, cmd, arg); > - if (err) { > - fput(filp); > - return err; > - } > - > - err = do_fcntl(fd, cmd, arg, filp); > - > + err = vfs_fcntl(fd, cmd, arg, filp); > fput(filp); > out: > return err; There is no point combining these two logically distinct patches. > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 6c08df2..65ebec5 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -394,6 +394,7 @@ struct kstatfs; > struct vm_area_struct; > struct vfsmount; > struct cred; > +struct ckpt_ctx; > > extern void __init inode_init(void); > extern void __init inode_init_early(void); > @@ -1093,6 +1094,8 @@ struct file_lock { > > #include <linux/fcntl.h> > > +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); > + > extern void send_sigio(struct fown_struct *fown, int fd, int band); > > #ifdef CONFIG_FILE_LOCKING > @@ -1504,6 +1507,8 @@ struct file_operations { > ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); > ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); > int (*setlease)(struct file *, long, struct file_lock **); > + int (*checkpoint)(struct ckpt_ctx *, struct file *); > + int (*collect)(struct ckpt_ctx *, struct file *); > }; > > struct inode_operations { You didn't add any documentation for this (unless it is in a following patch, which it shouldn't be). > @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); > loff_t inode_get_bytes(struct inode *inode); > void inode_set_bytes(struct inode *inode, loff_t bytes); > > +#define generic_file_checkpoint NULL > + > extern int vfs_readdir(struct file *, filldir_t, void *); > > extern int vfs_stat(char __user *, struct kstat *); Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means that checkpointing is allowed, and no action is required? Shouldn't it be an opt-in operation, where NULL means not allowed? Either way, I don't know if you need to have this #define, provided you have sufficient documentation. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() 2010-03-19 0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan [not found] ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-22 6:34 ` Nick Piggin 2010-03-22 10:16 ` Matt Helsley 2010-03-22 10:16 ` Matt Helsley 1 sibling, 2 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 6:34 UTC (permalink / raw) To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote: > While we assume all normal files and directories can be checkpointed, > there are, as usual in the VFS, specialized places that will always > need an ability to override these defaults. Although we could do this > completely in the checkpoint code, that would bitrot quickly. > > This adds a new 'file_operations' function for checkpointing a file. > It is assumed that there should be a dirt-simple way to make something > (un)checkpointable that fits in with current code. > > As you can see in the ext[234] patches down the road, all that we have > to do to make something simple be supported is add a single "generic" > f_op entry. > > Also adds a new 'file_operations' function for 'collecting' a file for > leak-detection during full-container checkpoint. This is useful for > those files that hold references to other "collectable" objects. Two > examples are pty files that point to corresponding tty objects, and > eventpoll files that refer to the files they are monitoring. > > Finally, this patch introduces vfs_fcntl() so that it can be called > from restart (see patch adding restart of files). > > Changelog[v17] > - Introduce 'collect' method > Changelog[v17] > - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h > > Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> > Acked-by: Serge E. Hallyn <serue@us.ibm.com> > Tested-by: Serge E. Hallyn <serue@us.ibm.com> > --- > fs/fcntl.c | 21 +++++++++++++-------- > include/linux/fs.h | 7 +++++++ > 2 files changed, 20 insertions(+), 8 deletions(-) > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 97e01dc..e1f02ca 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > return err; > } > > +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) > +{ > + int err; > + > + err = security_file_fcntl(filp, cmd, arg); > + if (err) > + goto out; > + err = do_fcntl(fd, cmd, arg, filp); > + out: > + return err; > +} > + > SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > { > struct file *filp; > @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > if (!filp) > goto out; > > - err = security_file_fcntl(filp, cmd, arg); > - if (err) { > - fput(filp); > - return err; > - } > - > - err = do_fcntl(fd, cmd, arg, filp); > - > + err = vfs_fcntl(fd, cmd, arg, filp); > fput(filp); > out: > return err; There is no point combining these two logically distinct patches. > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 6c08df2..65ebec5 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -394,6 +394,7 @@ struct kstatfs; > struct vm_area_struct; > struct vfsmount; > struct cred; > +struct ckpt_ctx; > > extern void __init inode_init(void); > extern void __init inode_init_early(void); > @@ -1093,6 +1094,8 @@ struct file_lock { > > #include <linux/fcntl.h> > > +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); > + > extern void send_sigio(struct fown_struct *fown, int fd, int band); > > #ifdef CONFIG_FILE_LOCKING > @@ -1504,6 +1507,8 @@ struct file_operations { > ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); > ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); > int (*setlease)(struct file *, long, struct file_lock **); > + int (*checkpoint)(struct ckpt_ctx *, struct file *); > + int (*collect)(struct ckpt_ctx *, struct file *); > }; > > struct inode_operations { You didn't add any documentation for this (unless it is in a following patch, which it shouldn't be). > @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); > loff_t inode_get_bytes(struct inode *inode); > void inode_set_bytes(struct inode *inode, loff_t bytes); > > +#define generic_file_checkpoint NULL > + > extern int vfs_readdir(struct file *, filldir_t, void *); > > extern int vfs_stat(char __user *, struct kstat *); Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means that checkpointing is allowed, and no action is required? Shouldn't it be an opt-in operation, where NULL means not allowed? Either way, I don't know if you need to have this #define, provided you have sufficient documentation. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() 2010-03-22 6:34 ` Nick Piggin @ 2010-03-22 10:16 ` Matt Helsley 2010-03-22 10:16 ` Matt Helsley 1 sibling, 0 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 10:16 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote: > On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote: > > While we assume all normal files and directories can be checkpointed, > > there are, as usual in the VFS, specialized places that will always > > need an ability to override these defaults. Although we could do this > > completely in the checkpoint code, that would bitrot quickly. > > > > This adds a new 'file_operations' function for checkpointing a file. > > It is assumed that there should be a dirt-simple way to make something > > (un)checkpointable that fits in with current code. > > > > As you can see in the ext[234] patches down the road, all that we have > > to do to make something simple be supported is add a single "generic" > > f_op entry. > > > > Also adds a new 'file_operations' function for 'collecting' a file for > > leak-detection during full-container checkpoint. This is useful for > > those files that hold references to other "collectable" objects. Two > > examples are pty files that point to corresponding tty objects, and > > eventpoll files that refer to the files they are monitoring. > > > > Finally, this patch introduces vfs_fcntl() so that it can be called > > from restart (see patch adding restart of files). > > > > Changelog[v17] > > - Introduce 'collect' method > > Changelog[v17] > > - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h > > > > Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > > Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > > Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > > --- > > fs/fcntl.c | 21 +++++++++++++-------- > > include/linux/fs.h | 7 +++++++ > > 2 files changed, 20 insertions(+), 8 deletions(-) > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c > > index 97e01dc..e1f02ca 100644 > > --- a/fs/fcntl.c > > +++ b/fs/fcntl.c > > @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > return err; > > } > > > > +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) > > +{ > > + int err; > > + > > + err = security_file_fcntl(filp, cmd, arg); > > + if (err) > > + goto out; > > + err = do_fcntl(fd, cmd, arg, filp); > > + out: > > + return err; > > +} > > + > > SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > > { > > struct file *filp; > > @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > > if (!filp) > > goto out; > > > > - err = security_file_fcntl(filp, cmd, arg); > > - if (err) { > > - fput(filp); > > - return err; > > - } > > - > > - err = do_fcntl(fd, cmd, arg, filp); > > - > > + err = vfs_fcntl(fd, cmd, arg, filp); > > fput(filp); > > out: > > return err; > > There is no point combining these two logically distinct patches. Good point. > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 6c08df2..65ebec5 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -394,6 +394,7 @@ struct kstatfs; > > struct vm_area_struct; > > struct vfsmount; > > struct cred; > > +struct ckpt_ctx; > > > > extern void __init inode_init(void); > > extern void __init inode_init_early(void); > > @@ -1093,6 +1094,8 @@ struct file_lock { > > > > #include <linux/fcntl.h> > > > > +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); > > + > > extern void send_sigio(struct fown_struct *fown, int fd, int band); > > > > #ifdef CONFIG_FILE_LOCKING > > @@ -1504,6 +1507,8 @@ struct file_operations { > > ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); > > ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); > > int (*setlease)(struct file *, long, struct file_lock **); > > + int (*checkpoint)(struct ckpt_ctx *, struct file *); > > + int (*collect)(struct ckpt_ctx *, struct file *); > > }; > > > > struct inode_operations { > > You didn't add any documentation for this (unless it is in a following > patch, which it shouldn't be). Another good point -- we should have added that to Documentation/filesystems/vfs.txt > > > @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); > > loff_t inode_get_bytes(struct inode *inode); > > void inode_set_bytes(struct inode *inode, loff_t bytes); > > > > +#define generic_file_checkpoint NULL > > + > > extern int vfs_readdir(struct file *, filldir_t, void *); > > > > extern int vfs_stat(char __user *, struct kstat *); > > Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means > that checkpointing is allowed, and no action is required? Shouldn't it > be an opt-in operation, where NULL means not allowed? generic_file_checkpoint is for files that have a seek operation and can be backed up or restored with a simple copy. A NULL checkpoint op means "not allowed" as you thought it should. What gave you the impression it was otherwise? Here's the relevant snippet from checkpoint/files.c: /* checkpoint callback for file pointer */ int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) { struct file *file = (struct file *) ptr; int ret; if (!file->f_op || !file->f_op->checkpoint) { ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", file, file->f_op); return -EBADF; } > Either way, I don't know if you need to have this #define, provided you > have sufficient documentation. We need it (or a suitable replacement) to avoid adding #ifdef around assignments to the operation in every filesystem. It's used if CONFIG_CHECKPOINT is not defined. Thanks for the review. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() 2010-03-22 6:34 ` Nick Piggin 2010-03-22 10:16 ` Matt Helsley @ 2010-03-22 10:16 ` Matt Helsley [not found] ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-03-22 11:00 ` Nick Piggin 1 sibling, 2 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 10:16 UTC (permalink / raw) To: Nick Piggin Cc: Oren Laadan, linux-fsdevel, containers, Matt Helsley, Andreas Dilger On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote: > On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote: > > While we assume all normal files and directories can be checkpointed, > > there are, as usual in the VFS, specialized places that will always > > need an ability to override these defaults. Although we could do this > > completely in the checkpoint code, that would bitrot quickly. > > > > This adds a new 'file_operations' function for checkpointing a file. > > It is assumed that there should be a dirt-simple way to make something > > (un)checkpointable that fits in with current code. > > > > As you can see in the ext[234] patches down the road, all that we have > > to do to make something simple be supported is add a single "generic" > > f_op entry. > > > > Also adds a new 'file_operations' function for 'collecting' a file for > > leak-detection during full-container checkpoint. This is useful for > > those files that hold references to other "collectable" objects. Two > > examples are pty files that point to corresponding tty objects, and > > eventpoll files that refer to the files they are monitoring. > > > > Finally, this patch introduces vfs_fcntl() so that it can be called > > from restart (see patch adding restart of files). > > > > Changelog[v17] > > - Introduce 'collect' method > > Changelog[v17] > > - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h > > > > Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> > > Acked-by: Serge E. Hallyn <serue@us.ibm.com> > > Tested-by: Serge E. Hallyn <serue@us.ibm.com> > > --- > > fs/fcntl.c | 21 +++++++++++++-------- > > include/linux/fs.h | 7 +++++++ > > 2 files changed, 20 insertions(+), 8 deletions(-) > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c > > index 97e01dc..e1f02ca 100644 > > --- a/fs/fcntl.c > > +++ b/fs/fcntl.c > > @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > return err; > > } > > > > +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) > > +{ > > + int err; > > + > > + err = security_file_fcntl(filp, cmd, arg); > > + if (err) > > + goto out; > > + err = do_fcntl(fd, cmd, arg, filp); > > + out: > > + return err; > > +} > > + > > SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > > { > > struct file *filp; > > @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) > > if (!filp) > > goto out; > > > > - err = security_file_fcntl(filp, cmd, arg); > > - if (err) { > > - fput(filp); > > - return err; > > - } > > - > > - err = do_fcntl(fd, cmd, arg, filp); > > - > > + err = vfs_fcntl(fd, cmd, arg, filp); > > fput(filp); > > out: > > return err; > > There is no point combining these two logically distinct patches. Good point. > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 6c08df2..65ebec5 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -394,6 +394,7 @@ struct kstatfs; > > struct vm_area_struct; > > struct vfsmount; > > struct cred; > > +struct ckpt_ctx; > > > > extern void __init inode_init(void); > > extern void __init inode_init_early(void); > > @@ -1093,6 +1094,8 @@ struct file_lock { > > > > #include <linux/fcntl.h> > > > > +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); > > + > > extern void send_sigio(struct fown_struct *fown, int fd, int band); > > > > #ifdef CONFIG_FILE_LOCKING > > @@ -1504,6 +1507,8 @@ struct file_operations { > > ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); > > ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); > > int (*setlease)(struct file *, long, struct file_lock **); > > + int (*checkpoint)(struct ckpt_ctx *, struct file *); > > + int (*collect)(struct ckpt_ctx *, struct file *); > > }; > > > > struct inode_operations { > > You didn't add any documentation for this (unless it is in a following > patch, which it shouldn't be). Another good point -- we should have added that to Documentation/filesystems/vfs.txt > > > @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); > > loff_t inode_get_bytes(struct inode *inode); > > void inode_set_bytes(struct inode *inode, loff_t bytes); > > > > +#define generic_file_checkpoint NULL > > + > > extern int vfs_readdir(struct file *, filldir_t, void *); > > > > extern int vfs_stat(char __user *, struct kstat *); > > Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means > that checkpointing is allowed, and no action is required? Shouldn't it > be an opt-in operation, where NULL means not allowed? generic_file_checkpoint is for files that have a seek operation and can be backed up or restored with a simple copy. A NULL checkpoint op means "not allowed" as you thought it should. What gave you the impression it was otherwise? Here's the relevant snippet from checkpoint/files.c: /* checkpoint callback for file pointer */ int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) { struct file *file = (struct file *) ptr; int ret; if (!file->f_op || !file->f_op->checkpoint) { ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", file, file->f_op); return -EBADF; } > Either way, I don't know if you need to have this #define, provided you > have sufficient documentation. We need it (or a suitable replacement) to avoid adding #ifdef around assignments to the operation in every filesystem. It's used if CONFIG_CHECKPOINT is not defined. Thanks for the review. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() [not found] ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 11:00 ` Nick Piggin 0 siblings, 0 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 11:00 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Mon, Mar 22, 2010 at 03:16:35AM -0700, Matt Helsley wrote: > On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote: > > On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote: > > Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means > > that checkpointing is allowed, and no action is required? Shouldn't it > > be an opt-in operation, where NULL means not allowed? > > generic_file_checkpoint is for files that have a seek operation and can be > backed up or restored with a simple copy. > > A NULL checkpoint op means "not allowed" as you thought it should. What > gave you the impression it was otherwise? Here's the relevant snippet > from checkpoint/files.c: Right I didn't check that far. It's just a bit strange to make it look like filling in an aop function but it is actually still NULL. > > /* checkpoint callback for file pointer */ > int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) > { > struct file *file = (struct file *) ptr; > int ret; > > if (!file->f_op || !file->f_op->checkpoint) { > ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", > file, file->f_op); > return -EBADF; > } > > > Either way, I don't know if you need to have this #define, provided you > > have sufficient documentation. > > We need it (or a suitable replacement) to avoid adding #ifdef around > assignments to the operation in every filesystem. It's used if > CONFIG_CHECKPOINT is not defined. If !CONFIG_CHECKPOINT, ->checkpoint should not exist and neither should it's callers. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() 2010-03-22 10:16 ` Matt Helsley [not found] ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 11:00 ` Nick Piggin 1 sibling, 0 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 11:00 UTC (permalink / raw) To: Matt Helsley; +Cc: Oren Laadan, linux-fsdevel, containers, Andreas Dilger On Mon, Mar 22, 2010 at 03:16:35AM -0700, Matt Helsley wrote: > On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote: > > On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote: > > Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means > > that checkpointing is allowed, and no action is required? Shouldn't it > > be an opt-in operation, where NULL means not allowed? > > generic_file_checkpoint is for files that have a seek operation and can be > backed up or restored with a simple copy. > > A NULL checkpoint op means "not allowed" as you thought it should. What > gave you the impression it was otherwise? Here's the relevant snippet > from checkpoint/files.c: Right I didn't check that far. It's just a bit strange to make it look like filling in an aop function but it is actually still NULL. > > /* checkpoint callback for file pointer */ > int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) > { > struct file *file = (struct file *) ptr; > int ret; > > if (!file->f_op || !file->f_op->checkpoint) { > ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", > file, file->f_op); > return -EBADF; > } > > > Either way, I don't know if you need to have this #define, provided you > > have sufficient documentation. > > We need it (or a suitable replacement) to avoid adding #ifdef around > assignments to the operation in every filesystem. It's used if > CONFIG_CHECKPOINT is not defined. If !CONFIG_CHECKPOINT, ->checkpoint should not exist and neither should it's callers. ^ permalink raw reply [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 23:19 ` Andreas Dilger [not found] ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-19 0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan ` (14 subsequent siblings) 17 siblings, 2 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan Dump the file table with 'struct ckpt_hdr_file_table, followed by all open file descriptors. Because the 'struct file' corresponding to an fd can be shared, they are assigned an objref and registered in the object hash. A reference to the 'file *' is kept for as long as it lives in the hash (the hash is only cleaned up at the end of the checkpoint). Also provide generic_checkpoint_file() and generic_restore_file() which is good for normal files and directories. It does not support yet unlinked files or directories. Changelog[v19]: - Fix false negative of test for unlinked files at checkpoint Changelog[v19-rc3]: - [Serge Hallyn] Rename fs_mnt to root_fs_path - [Dave Hansen] Error out on file locks and leases - [Serge Hallyn] Refuse checkpoint of file with f_owner Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Add a few more ckpt_write_err()s - [Dan Smith] Export fill_fname() as ckpt_fill_fname() - Introduce ckpt_collect_file() that also uses file->collect method - In collect_file_stabl() use retval from ckpt_obj_collect() to test for first-time-object Changelog[v17]: - Only collect sub-objects of files_struct once - Better file error debugging - Use (new) d_unlinked() Changelog[v16]: - Fix compile warning in checkpoint_bad() Changelog[v16]: - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - File objects are dumped/restored prior to the first reference - Introduce a per file-type restore() callback - Use struct file_operations->checkpoint() - Put code for generic file descriptors in a separate function - Use one CKPT_FILE_GENERIC for both regular files and dirs - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Rename: ckpt_write_files() => checkpoint_fd_table() - Rename: ckpt_write_fd_data() => checkpoint_file() - Discard field 'h->parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Discard handling of opened symlinks (there is no such thing) - ckpt_scan_fds() retries from scratch if hits size limits Changelog[v9]: - Fix a couple of leaks in ckpt_write_files() - Drop useless kfree from ckpt_scan_fds() Changelog[v8]: - initialize 'coe' to workaround gcc false warning Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/Makefile | 3 +- checkpoint/checkpoint.c | 11 + checkpoint/files.c | 444 ++++++++++++++++++++++++++++++++++++++ checkpoint/objhash.c | 52 +++++ checkpoint/process.c | 33 +++- checkpoint/sys.c | 8 + fs/locks.c | 35 +++ include/linux/checkpoint.h | 19 ++ include/linux/checkpoint_hdr.h | 59 +++++ include/linux/checkpoint_types.h | 5 + include/linux/fs.h | 10 + 11 files changed, 677 insertions(+), 2 deletions(-) create mode 100644 checkpoint/files.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 5aa6a75..1d0c058 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \ objhash.o \ checkpoint.o \ restart.o \ - process.o + process.o \ + files.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index c016a2d..2bc2495 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -18,6 +18,7 @@ #include <linux/time.h> #include <linux/fs.h> #include <linux/file.h> +#include <linux/fs_struct.h> #include <linux/dcache.h> #include <linux/mount.h> #include <linux/utsname.h> @@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) { struct task_struct *task; struct nsproxy *nsproxy; + struct fs_struct *fs; /* * No need for explicit cleanup here, because if an error @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) return -EINVAL; /* cleanup by ckpt_ctx_free() */ } + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ + task_lock(ctx->root_task); + fs = ctx->root_task->fs; + read_lock(&fs->lock); + ctx->root_fs_path = fs->root; + path_get(&ctx->root_fs_path); + read_unlock(&fs->lock); + task_unlock(ctx->root_task); + return 0; } diff --git a/checkpoint/files.c b/checkpoint/files.c new file mode 100644 index 0000000..7a57b24 --- /dev/null +++ b/checkpoint/files.c @@ -0,0 +1,444 @@ +/* + * Checkpoint file descriptors + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DFILE + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/deferqueue.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + + +/************************************************************************** + * Checkpoint + */ + +/** + * ckpt_fill_fname - return pathname of a given file + * @path: path name + * @root: relative root + * @buf: buffer for pathname + * @len: buffer length (in) and pathname length (out) + */ +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) +{ + struct path tmp = *root; + char *fname; + + BUG_ON(!buf); + spin_lock(&dcache_lock); + fname = __d_path(path, &tmp, buf, *len); + spin_unlock(&dcache_lock); + if (IS_ERR(fname)) + return fname; + *len = (buf + (*len) - fname); + /* + * FIX: if __d_path() changed these, it must have stepped out of + * init's namespace. Since currently we require a unified namespace + * within the container: simply fail. + */ + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) + fname = ERR_PTR(-EBADF); + + return fname; +} + +/** + * checkpoint_fname - write a file name + * @ctx: checkpoint context + * @path: path name + * @root: relative root + */ +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root) +{ + char *buf, *fname; + int ret, flen; + + /* + * FIXME: we can optimize and save memory (and storage) if we + * share strings (through objhash) and reference them instead + */ + + flen = PATH_MAX; + buf = kmalloc(flen, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + fname = ckpt_fill_fname(path, root, buf, &flen); + if (!IS_ERR(fname)) { + ret = ckpt_write_obj_type(ctx, fname, flen, + CKPT_HDR_FILE_NAME); + } else { + ret = PTR_ERR(fname); + ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n", + path->dentry->d_name.name); + } + + kfree(buf); + return ret; +} + +#define CKPT_DEFAULT_FDTABLE 256 /* an initial guess */ + +/** + * scan_fds - scan file table and construct array of open fds + * @files: files_struct pointer + * @fdtable: (output) array of open fds + * + * Returns the number of open fds found, and also the file table + * array via *fdtable. The caller should free the array. + * + * The caller must validate the file descriptors collected in the + * array before using them, e.g. by using fcheck_files(), in case + * the task's fdtable changes in the meantime. + */ +static int scan_fds(struct files_struct *files, int **fdtable) +{ + struct fdtable *fdt; + int *fds = NULL; + int i = 0, n = 0; + int tot = CKPT_DEFAULT_FDTABLE; + + /* + * We assume that all tasks possibly sharing the file table are + * frozen (or we are a single process and we checkpoint ourselves). + * Therefore, we can safely proceed after krealloc() from where we + * left off. Otherwise the file table may be modified by another + * task after we scan it. The behavior is this case is undefined, + * and either checkpoint or restart will likely fail. + */ + retry: + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); + if (!fds) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + for (/**/; i < fdt->max_fds; i++) { + if (!fcheck_files(files, i)) + continue; + if (n == tot) { + rcu_read_unlock(); + tot *= 2; /* won't overflow: kmalloc will fail */ + goto retry; + } + fds[n++] = i; + } + rcu_read_unlock(); + + *fdtable = fds; + return n; +} + +int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + h->f_flags = file->f_flags; + h->f_mode = file->f_mode; + h->f_pos = file->f_pos; + h->f_version = file->f_version; + + ckpt_debug("file %s credref %d", file->f_dentry->d_name.name, + h->f_credref); + + /* FIX: need also file->uid, file->gid, file->f_owner, etc */ + + return 0; +} + +int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_generic *h; + int ret; + + /* + * FIXME: when we'll add support for unlinked files/dirs, we'll + * need to distinguish between unlinked filed and unlinked dirs. + */ + if (d_unlinked(file->f_dentry)) { + ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", + file); + return -EBADF; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_GENERIC; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->common.h); + if (ret < 0) + goto out; + ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); + out: + ckpt_hdr_put(ctx, h); + return ret; +} +EXPORT_SYMBOL(generic_file_checkpoint); + +/* checkpoint callback for file pointer */ +int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) +{ + struct file *file = (struct file *) ptr; + int ret; + + if (!file->f_op || !file->f_op->checkpoint) { + ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", + file, file->f_op); + return -EBADF; + } + + ret = file->f_op->checkpoint(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); + return ret; +} + +/** + * ckpt_write_file_desc - dump the state of a given file descriptor + * @ctx: checkpoint context + * @files: files_struct pointer + * @fd: file descriptor + * + * Saves the state of the file descriptor; looks up the actual file + * pointer in the hash table, and if found saves the matching objref, + * otherwise calls ckpt_write_file to dump the file pointer too. + */ +static int checkpoint_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct ckpt_hdr_file_desc *h; + struct file *file = NULL; + struct fdtable *fdt; + int objref, ret; + int coe = 0; /* avoid gcc warning */ + pid_t pid; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); + if (!h) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) { + coe = FD_ISSET(fd, fdt->close_on_exec); + get_file(file); + } + rcu_read_unlock(); + + ret = find_locks_with_owner(file, files); + /* + * find_locks_with_owner() returns an error when there + * are no locks found, so we *want* it to return an error + * code. Its success means we have to fail the checkpoint. + */ + if (!ret) { + ret = -EBADF; + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); + goto out; + } + + /* sanity check (although this shouldn't happen) */ + ret = -EBADF; + if (!file) { + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); + goto out; + } + + /* + * TODO: Implement c/r of fowner and f_sigio. Should be + * trivial, but for now we just refuse its checkpoint + */ + pid = f_getown(file); + if (pid) { + ret = -EBUSY; + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); + goto out; + } + + /* + * if seen first time, this will add 'file' to the objhash, keep + * a reference to it, dump its state while at it. + */ + objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE); + ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe); + if (objref < 0) { + ret = objref; + goto out; + } + + h->fd_objref = objref; + h->fd_descriptor = fd; + h->fd_close_on_exec = coe; + + ret = ckpt_write_obj(ctx, &h->h); +out: + ckpt_hdr_put(ctx, h); + if (file) + fput(file); + return ret; +} + +static int do_checkpoint_file_table(struct ckpt_ctx *ctx, + struct files_struct *files) +{ + struct ckpt_hdr_file_table *h; + int *fdtable = NULL; + int nfds, n, ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE); + if (!h) + return -ENOMEM; + + nfds = scan_fds(files, &fdtable); + if (nfds < 0) { + ret = nfds; + goto out; + } + + h->fdt_nfds = nfds; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + goto out; + + ckpt_debug("nfds %d\n", nfds); + for (n = 0; n < nfds; n++) { + ret = checkpoint_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + goto out; + } + + ret = deferqueue_run(ctx->files_deferq); + ckpt_debug("files_deferq ran %d entries\n", ret); + if (ret > 0) + ret = 0; + out: + kfree(fdtable); + return ret; +} + +/* checkpoint callback for file table */ +int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_file_table(ctx, (struct files_struct *) ptr); +} + +/* checkpoint wrapper for file table */ +int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int objref; + + files = get_files_struct(t); + if (!files) + return -EBUSY; + objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE); + put_files_struct(files); + + return objref; +} + +/*********************************************************************** + * Collect + */ + +int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file) +{ + int ret; + + ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE); + if (ret <= 0) + return ret; + /* if first time for this file (ret > 0), invoke ->collect() */ + if (file->f_op->collect) + ret = file->f_op->collect(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file); + return ret; +} + +static int collect_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct fdtable *fdt; + struct file *file; + int ret; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) + get_file(file); + rcu_read_unlock(); + + if (!file) { + ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file); + return -EBUSY; + } + + ret = ckpt_collect_file(ctx, file); + fput(file); + + return ret; +} + +static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files) +{ + int *fdtable; + int nfds, n; + int ret; + + /* if already exists (ret == 0), nothing to do */ + ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE); + if (ret <= 0) + return ret; + + /* if first time for this file table (ret > 0), proceed inside */ + nfds = scan_fds(files, &fdtable); + if (nfds < 0) + return nfds; + + for (n = 0; n < nfds; n++) { + ret = collect_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + break; + } + + kfree(fdtable); + return ret; +} + +int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int ret; + + files = get_files_struct(t); + if (!files) { + ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n"); + return -EBUSY; + } + ret = collect_file_table(ctx, files); + put_files_struct(files); + + return ret; +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 22b1601..f25d130 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -13,6 +13,8 @@ #include <linux/kernel.h> #include <linux/hash.h> +#include <linux/file.h> +#include <linux/fdtable.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr) return 0; } +static int obj_file_table_grab(void *ptr) +{ + atomic_inc(&((struct files_struct *) ptr)->count); + return 0; +} + +static void obj_file_table_drop(void *ptr, int lastref) +{ + put_files_struct((struct files_struct *) ptr); +} + +static int obj_file_table_users(void *ptr) +{ + return atomic_read(&((struct files_struct *) ptr)->count); +} + +static int obj_file_grab(void *ptr) +{ + get_file((struct file *) ptr); + return 0; +} + +static void obj_file_drop(void *ptr, int lastref) +{ + fput((struct file *) ptr); +} + +static int obj_file_users(void *ptr) +{ + return atomic_long_read(&((struct file *) ptr)->f_count); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_drop = obj_no_drop, .ref_grab = obj_no_grab, }, + /* files_struct object */ + { + .obj_name = "FILE_TABLE", + .obj_type = CKPT_OBJ_FILE_TABLE, + .ref_drop = obj_file_table_drop, + .ref_grab = obj_file_table_grab, + .ref_users = obj_file_table_users, + .checkpoint = checkpoint_file_table, + }, + /* file object */ + { + .obj_name = "FILE", + .obj_type = CKPT_OBJ_FILE, + .ref_drop = obj_file_drop, + .ref_grab = obj_file_grab, + .ref_users = obj_file_users, + .checkpoint = checkpoint_file, + }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index ef394a5..adc34a2 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN); } +static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_task_objs *h; + int files_objref; + int ret; + + files_objref = checkpoint_obj_file_table(ctx, t); + ckpt_debug("files: objref %d\n", files_objref); + if (files_objref < 0) { + ckpt_err(ctx, files_objref, "%(T)files_struct\n"); + return files_objref; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS); + if (!h) + return -ENOMEM; + h->files_objref = files_objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + return ret; +} + /* dump the task_struct of a given task */ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t) { @@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) goto out; ret = checkpoint_cpu(ctx, t); ckpt_debug("cpu %d\n", ret); + if (ret < 0) + goto out; + ret = checkpoint_task_objs(ctx, t); + ckpt_debug("objs %d\n", ret); out: ctx->tsk = NULL; return ret; @@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) { - return 0; + int ret; + + ret = ckpt_collect_file_table(ctx, t); + + return ret; } /*********************************************************************** diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 926c937..30b8004 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->kflags & CKPT_CTX_RESTART) restore_debug_free(ctx); + if (ctx->files_deferq) + deferqueue_destroy(ctx->files_deferq); + if (ctx->file) fput(ctx->file); if (ctx->logfile) fput(ctx->logfile); ckpt_obj_hash_free(ctx); + path_put(&ctx->root_fs_path); if (ctx->tasks_arr) task_arr_free(ctx); @@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, if (ckpt_obj_hash_alloc(ctx) < 0) goto err; + ctx->files_deferq = deferqueue_create(); + if (!ctx->files_deferq) + goto err; + atomic_inc(&ctx->refcount); return ctx; err: diff --git a/fs/locks.c b/fs/locks.c index a8794f2..721481a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner) EXPORT_SYMBOL(locks_remove_posix); +int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + struct inode *inode = filp->f_path.dentry->d_inode; + struct file_lock **inode_fl; + int ret = -EEXIST; + + lock_kernel(); + for_each_lock(inode, inode_fl) { + struct file_lock *fl = *inode_fl; + /* + * We could use posix_same_owner() along with a 'fake' + * file_lock. But, the fake file will never have the + * same fl_lmops as the fl that we are looking for and + * posix_same_owner() would just fall back to this + * check anyway. + */ + if (IS_POSIX(fl)) { + if (fl->fl_owner == owner) { + ret = 0; + break; + } + } else if (IS_FLOCK(fl) || IS_LEASE(fl)) { + if (fl->fl_file == filp) { + ret = 0; + break; + } + } else { + WARN(1, "unknown file lock type, fl_flags: %x", + fl->fl_flags); + } + } + unlock_kernel(); + return ret; +} + /* * This function is called on the last close of an open file. */ diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 50ce8f9..d74a890 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx, extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max); extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type); +extern char *ckpt_fill_fname(struct path *path, struct path *root, + char *buf, int *len); + /* ckpt kflags */ #define ckpt_set_ctx_kflag(__ctx, __kflag) \ set_bit(__kflag##_BIT, &(__ctx)->kflags) @@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_restart_block(struct ckpt_ctx *ctx); +/* file table */ +extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, + struct task_struct *t); +extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr); + +/* files */ +extern int checkpoint_fname(struct ckpt_ctx *ctx, + struct path *path, struct path *root); +extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file); +extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr); + +extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h); + static inline int ckpt_validate_errno(int errno) { return (errno >= 0) && (errno < MAX_ERRNO); @@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno) #define CKPT_DSYS 0x2 /* generic (system) */ #define CKPT_DRW 0x4 /* image read/write */ #define CKPT_DOBJ 0x8 /* shared objects */ +#define CKPT_DFILE 0x10 /* files and filesystem */ #define CKPT_DDEFAULT 0xffff /* default debug level */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index cdca9e4..3222545 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -71,6 +71,8 @@ enum { #define CKPT_HDR_TREE CKPT_HDR_TREE CKPT_HDR_TASK, #define CKPT_HDR_TASK CKPT_HDR_TASK + CKPT_HDR_TASK_OBJS, +#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS CKPT_HDR_RESTART_BLOCK, #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK CKPT_HDR_THREAD, @@ -80,6 +82,15 @@ enum { /* 201-299: reserved for arch-dependent */ + CKPT_HDR_FILE_TABLE = 301, +#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE + CKPT_HDR_FILE_DESC, +#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC + CKPT_HDR_FILE_NAME, +#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME + CKPT_HDR_FILE, +#define CKPT_HDR_FILE CKPT_HDR_FILE + CKPT_HDR_TAIL = 9001, #define CKPT_HDR_TAIL CKPT_HDR_TAIL @@ -106,6 +117,10 @@ struct ckpt_hdr_objref { enum obj_type { CKPT_OBJ_IGNORE = 0, #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE + CKPT_OBJ_FILE_TABLE, +#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE + CKPT_OBJ_FILE, +#define CKPT_OBJ_FILE CKPT_OBJ_FILE CKPT_OBJ_MAX #define CKPT_OBJ_MAX CKPT_OBJ_MAX }; @@ -188,6 +203,12 @@ struct ckpt_hdr_task { __u64 clear_child_tid; } __attribute__((aligned(8))); +/* task's shared resources */ +struct ckpt_hdr_task_objs { + struct ckpt_hdr h; + __s32 files_objref; +} __attribute__((aligned(8))); + /* restart blocks */ struct ckpt_hdr_restart_block { struct ckpt_hdr h; @@ -220,4 +241,42 @@ enum restart_block_type { #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX }; +/* file system */ +struct ckpt_hdr_file_table { + struct ckpt_hdr h; + __s32 fdt_nfds; +} __attribute__((aligned(8))); + +/* file descriptors */ +struct ckpt_hdr_file_desc { + struct ckpt_hdr h; + __s32 fd_objref; + __s32 fd_descriptor; + __u32 fd_close_on_exec; +} __attribute__((aligned(8))); + +enum file_type { + CKPT_FILE_IGNORE = 0, +#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE + CKPT_FILE_GENERIC, +#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC + CKPT_FILE_MAX +#define CKPT_FILE_MAX CKPT_FILE_MAX +}; + +/* file objects */ +struct ckpt_hdr_file { + struct ckpt_hdr h; + __u32 f_type; + __u32 f_mode; + __u32 f_flags; + __u32 _padding; + __u64 f_pos; + __u64 f_version; +} __attribute__((aligned(8))); + +struct ckpt_hdr_file_generic { + struct ckpt_hdr_file common; +} __attribute__((aligned(8))); + #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 90bbb16..aae6755 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -14,6 +14,8 @@ #include <linux/sched.h> #include <linux/nsproxy.h> +#include <linux/list.h> +#include <linux/path.h> #include <linux/fs.h> #include <linux/ktime.h> #include <linux/wait.h> @@ -40,6 +42,9 @@ struct ckpt_ctx { atomic_t refcount; struct ckpt_obj_hash *obj_hash; /* repository for shared objects */ + struct deferqueue_head *files_deferq; /* deferred file-table work */ + + struct path root_fs_path; /* container root (FIXME) */ struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 65ebec5..7902a51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t); extern void locks_remove_flock(struct file *); extern void locks_release_private(struct file_lock *); extern void posix_test_lock(struct file *, struct file_lock *); +extern int find_locks_with_owner(struct file *filp, fl_owner_t owner); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); extern int posix_lock_file_wait(struct file *, struct file_lock *); extern int posix_unblock_lock(struct file *, struct file_lock *); @@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner) return; } +static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + return -ENOENT; +} + static inline void locks_remove_flock(struct file *filp) { return; @@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#ifdef CONFIG_CHECKPOINT +extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file); +#else #define generic_file_checkpoint NULL +#endif extern int vfs_readdir(struct file *, filldir_t, void *); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-19 0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan @ 2010-03-19 23:19 ` Andreas Dilger 2010-03-20 4:43 ` Matt Helsley [not found] ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org> [not found] ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 1 sibling, 2 replies; 88+ messages in thread From: Andreas Dilger @ 2010-03-19 23:19 UTC (permalink / raw) To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley On 2010-03-18, at 18:59, Oren Laadan wrote: > +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, > struct path *root) > +{ > + fname = ckpt_fill_fname(path, root, buf, &flen); > + if (!IS_ERR(fname)) { > + ret = ckpt_write_obj_type(ctx, fname, flen, > + CKPT_HDR_FILE_NAME); What is the intended use case for the checkpoint/restore being developed here? It seems like a major risk to do the checkpoint using the filename, since this is not guaranteed to stay constant and the restore may give you a different state than what was running when the checkpoint was done. Storing a file handle in the checkpoint, instead of (or in addition to) the filename would allow restoring the state correctly. Note that you would also need to store some kind of FSID as part of the file handle, which is a functionality that would be desirable for Aneesh's recent open_by_handle() patches as well, so getting this right once would be of use to both projects. That said, if the intent is to allow the restore to be done on another node with a "similar" filesystem (e.g. created by rsync/node image), instead of having a coherent distributed filesystem on all of the nodes then the filename makes sense. I would recommend to store both the file handle+FSID and the filename, preferring the former for "100% correct" restores on the same node, and the latter for being able to restore on a similar node (e.g. system files and such that are expected to be the same on all nodes, but do not necessarily have the same inode number). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-19 23:19 ` Andreas Dilger @ 2010-03-20 4:43 ` Matt Helsley [not found] ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-03-21 17:27 ` Jamie Lokier [not found] ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org> 1 sibling, 2 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-20 4:43 UTC (permalink / raw) To: Andreas Dilger; +Cc: Oren Laadan, linux-fsdevel, containers, Matt Helsley On Fri, Mar 19, 2010 at 05:19:22PM -0600, Andreas Dilger wrote: > On 2010-03-18, at 18:59, Oren Laadan wrote: > >+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, > >struct path *root) > >+{ > >+ fname = ckpt_fill_fname(path, root, buf, &flen); > >+ if (!IS_ERR(fname)) { > >+ ret = ckpt_write_obj_type(ctx, fname, flen, > >+ CKPT_HDR_FILE_NAME); > > What is the intended use case for the checkpoint/restore being > developed here? It seems like a major risk to do the checkpoint Yes, as you anticipated below, we want to be able to migrate the image to a similar node. > using the filename, since this is not guaranteed to stay constant > and the restore may give you a different state than what was running > when the checkpoint was done. Storing a file handle in the We're aware of this. Our assumption is userspace will freeze the filesystem and/or take suitable snapshots (e.g. with btrfs) while the tasks being checkpointed are also frozen. If userspace wants to freeze everything but the task performing the checkpoint then that's fine too. We decided to have userspace checkpoint the filesystem contents because it will likely take an extraordinarily long time. We anticipate that userspace will want to take advantage of many time-saving strategies which would be impossible to anticipate perfectly for our kernel syscall ABI. Even though a wide set of time-saving strategies is available, the goal is to keep the checkpoint image format and content independent of the tools that perform migration. > checkpoint, instead of (or in addition to) the filename would allow > restoring the state correctly. > > Note that you would also need to store some kind of FSID as part of > the file handle, which is a functionality that would be desirable > for Aneesh's recent open_by_handle() patches as well, so getting > this right once would be of use to both projects. I haven't looked at those, sorry. It may be useful but I think there's room for adding that in the future as you hinted above. My guess is, depending on the environment of the restarting machine, an FSID might not even be enough. Again -- I need to find some time to review those patches before I can be sure :). Userspace coordinates the management of the nodes and thus knows best how to map things like major:minor, /dev/foo, and/or uuids to the appropriate "things" when it comes time to restart. The best the kernel can do is provide all of those so that userspace can make the choices it needs to. However, most of that information is already available via /proc in mountinfo or via other userspace tools. So we don't save it in the image nor do we provide new interfaces to get it. > That said, if the intent is to allow the restore to be done on > another node with a "similar" filesystem (e.g. created by rsync/node > image), instead of having a coherent distributed filesystem on all > of the nodes then the filename makes sense. Yes, this is the intent. > I would recommend to store both the file handle+FSID and the > filename, preferring the former for "100% correct" restores on the > same node, and the latter for being able to restore on a similar > node (e.g. system files and such that are expected to be the same on > all nodes, but do not necessarily have the same inode number). This sounds like a good idea for the future. However I do not think inclusion of our patches should be predicated on this since the patches are still useful for local restart (thanks to things like mount namespaces) and migration without file handles. Thanks for having a look at these! Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-21 17:27 ` Jamie Lokier 0 siblings, 0 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-21 17:27 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Matt Helsley wrote: > > That said, if the intent is to allow the restore to be done on > > another node with a "similar" filesystem (e.g. created by rsync/node > > image), instead of having a coherent distributed filesystem on all > > of the nodes then the filename makes sense. > > Yes, this is the intent. I would worry about programs which are using files which have been deleted, renamed, or (very common) renamed-over by another process after being opened, as there's a good chance they will successfully open the wrong file after c/r, and corrupt state from then on. This can be avoided by ensuring every checkpointed application is specially "c/r aware", but that makes the feature a lot less attractive, as well as uncomfortably unsafe to use on arbitrary processes. Ideally, c/r would fail on some types of process (e.g. using sockets), but at least fail in a safe way that does not lead to quiet data corruption. -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-20 4:43 ` Matt Helsley [not found] ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-21 17:27 ` Jamie Lokier [not found] ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org> ` (2 more replies) 1 sibling, 3 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-21 17:27 UTC (permalink / raw) To: Matt Helsley; +Cc: Andreas Dilger, Oren Laadan, linux-fsdevel, containers Matt Helsley wrote: > > That said, if the intent is to allow the restore to be done on > > another node with a "similar" filesystem (e.g. created by rsync/node > > image), instead of having a coherent distributed filesystem on all > > of the nodes then the filename makes sense. > > Yes, this is the intent. I would worry about programs which are using files which have been deleted, renamed, or (very common) renamed-over by another process after being opened, as there's a good chance they will successfully open the wrong file after c/r, and corrupt state from then on. This can be avoided by ensuring every checkpointed application is specially "c/r aware", but that makes the feature a lot less attractive, as well as uncomfortably unsafe to use on arbitrary processes. Ideally, c/r would fail on some types of process (e.g. using sockets), but at least fail in a safe way that does not lead to quiet data corruption. -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org> @ 2010-03-21 19:40 ` Serge E. Hallyn 2010-03-22 1:06 ` Matt Helsley 1 sibling, 0 replies; 88+ messages in thread From: Serge E. Hallyn @ 2010-03-21 19:40 UTC (permalink / raw) To: Jamie Lokier Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org): > Matt Helsley wrote: > > > That said, if the intent is to allow the restore to be done on > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > image), instead of having a coherent distributed filesystem on all > > > of the nodes then the filename makes sense. > > > > Yes, this is the intent. > > I would worry about programs which are using files which have been > deleted, renamed, or (very common) renamed-over by another process > after being opened, as there's a good chance they will successfully > open the wrong file after c/r, and corrupt state from then on. Userspace is expected to back up and restore the filesystem, for instance using a btrfs snapshot or a simple rsync or tar. If we detect anything which really is not supported (for instance inotify for now) then we fail and leave a log message explaining the failure. thanks, -serge ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org> 2010-03-21 19:40 ` Serge E. Hallyn @ 2010-03-22 1:06 ` Matt Helsley 1 sibling, 0 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 1:06 UTC (permalink / raw) To: Jamie Lokier Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > Matt Helsley wrote: > > > That said, if the intent is to allow the restore to be done on > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > image), instead of having a coherent distributed filesystem on all > > > of the nodes then the filename makes sense. > > > > Yes, this is the intent. > > I would worry about programs which are using files which have been > deleted, renamed, or (very common) renamed-over by another process > after being opened, as there's a good chance they will successfully > open the wrong file after c/r, and corrupt state from then on. The code in the patches does check for unlinked files and refuses to checkpoint if an unlinked file is open. Yes, this limits the usefulness of the code somewhat but it's a problem we can solve and c/r is still quite useful without the solution. My favorite solution for unlinked files is keeping the contents of the file in the checkpoint image. Another solution is relinking it to a new "safe" location in the filesystem. Determining the "safe" location is not very clean because we need one "safe" location per filesystem being backed-up. Hence I tend to favor the first approach. Neither solution is implemented and thoroughly tested yet though. These solutions are needed because the data is not available via a normal filesystem backup. Renames are dealt with by requiring userspace to freeze and/or safely take a snapshot of the filesystem as with any backup. > This can be avoided by ensuring every checkpointed application is > specially "c/r aware", but that makes the feature a lot less > attractive, as well as uncomfortably unsafe to use on arbitrary We avoided using that solution for the very flaws you point out. In fact, so far we've managed to avoid requiring cooperation with the tasks being checkpointed. > processes. Ideally, c/r would fail on some types of process > (e.g. using sockets), but at least fail in a safe way that does not > lead to quiet data corruption. We've done our best to try and reach that ideal. You're welcome to have a look at the code to see if you can find any ways in which we haven't. Here's the code that refuses to checkpoint unsupported files. I think it's pretty easy to read: int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) { struct file *file = (struct file *) ptr; int ret; if (!file->f_op || !file->f_op->checkpoint) { ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", file, file->f_op); return -EBADF; } if (is_dnotify_attached(file)) { ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file); return -EBADF; } ret = file->f_op->checkpoint(ctx, file); if (ret < 0) ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); return ret; } (As Serge noted, we don't support inotify. inotify and fanotify require an fd to register the fsnotify marks and the struct file associated with that fd lacks the f_ops->checkpoint operation, hence that will cause checkpoint to fail too and, again, there will be no silent corruption) Negative return values cause sys_checkpoint() to stop checkpointing and return the given errno. The f_op->checkpoint is often a generic operation which ensures that the file is not unlinked before it saves things like the position of the file (checkpoint_file_common()) and the path to the file (checkpoint_fname()): int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) { struct ckpt_hdr_file_generic *h; int ret; /* * FIXME: when we'll add support for unlinked files/dirs, we'll * need to distinguish between unlinked filed and unlinked dirs. */ if (d_unlinked(file->f_dentry)) { ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", file); return -EBADF; } h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); if (!h) return -ENOMEM; h->common.f_type = CKPT_FILE_GENERIC; ret = checkpoint_file_common(ctx, file, &h->common); if (ret < 0) goto out; ret = ckpt_write_obj(ctx, &h->common.h); if (ret < 0) goto out; ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); out: ckpt_hdr_put(ctx, h); return ret; } EXPORT_SYMBOL(generic_file_checkpoint); I wrote a simple script to look for missing operations in things like file_operations. It can output counts in directories/files or show the spot in the files where the struct is defined and a little context. I used that script to check which files and protocols aren't supported (for 2.6.33-rc8), I placed a histogram of the output in the wiki, and I've tried to keep it up-to-date. https://ckpt.wiki.kernel.org/index.php/UncheckpointableFilesystems https://ckpt.wiki.kernel.org/index.php/UncheckpointableProtocols The script is also there for anyone who wants to use it on newer kernels. Here's the output which is of interest to folks on linux-fsdevel for anyone who doesn't wish to follow a link -- the number of file_operations structures missing the .checkpoint operation: 162 arch 3 block 1 crypto 1 Documentation 718 drivers 178 fs 3 9p 8 afs 1 autofs 3 autofs4 1 bad_inode.c 3 binfmt_misc.c 1 block_dev.c 2 cachefiles 1 char_dev.c 15 cifs 4 coda 2 configfs 3 debugfs 8 dlm 1 ext4 1 fifo.c 1 filesystems.c 3 fscache 9 fuse 5 gfs2 1 hugetlbfs 1 jbd2 6 jfs 1 libfs.c 1 locks.c 2 ncpfs 2 nfs 5 nfsd 1 no-block.c 1 notify 1 ntfs 15 ocfs2 55 proc 1 reiserfs 1 signalfd.c 2 smbfs 3 sysfs 1 timerfd.c 3 xfs 1 include 4 ipc 88 kernel 3 lib 12 mm 164 net 1 samples 35 security 29 sound 4 virt Notes: 1. The missing checkpoint file operation in fs/fifo.c is only an artifact of the unusual way fifo file ops are assigned. FIFOs are supported. 2. The ext4 missing file operation is for the multiblock groups file in /proc IMHO trying to checkpoint the contents of /proc files is usually a bad idea. Thankfuly, most programs don't hold these files open for very long. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-21 17:27 ` Jamie Lokier [not found] ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org> @ 2010-03-21 19:40 ` Serge E. Hallyn [not found] ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2010-03-21 20:58 ` Daniel Lezcano 2010-03-22 1:06 ` Matt Helsley 2 siblings, 2 replies; 88+ messages in thread From: Serge E. Hallyn @ 2010-03-21 19:40 UTC (permalink / raw) To: Jamie Lokier; +Cc: Matt Helsley, linux-fsdevel, Andreas Dilger, containers Quoting Jamie Lokier (jamie@shareable.org): > Matt Helsley wrote: > > > That said, if the intent is to allow the restore to be done on > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > image), instead of having a coherent distributed filesystem on all > > > of the nodes then the filename makes sense. > > > > Yes, this is the intent. > > I would worry about programs which are using files which have been > deleted, renamed, or (very common) renamed-over by another process > after being opened, as there's a good chance they will successfully > open the wrong file after c/r, and corrupt state from then on. Userspace is expected to back up and restore the filesystem, for instance using a btrfs snapshot or a simple rsync or tar. If we detect anything which really is not supported (for instance inotify for now) then we fail and leave a log message explaining the failure. thanks, -serge ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2010-03-21 20:58 ` Daniel Lezcano 0 siblings, 0 replies; 88+ messages in thread From: Daniel Lezcano @ 2010-03-21 20:58 UTC (permalink / raw) To: Serge E. Hallyn Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jamie Lokier, Andreas Dilger Serge E. Hallyn wrote: > Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org): > >> Matt Helsley wrote: >> >>>> That said, if the intent is to allow the restore to be done on >>>> another node with a "similar" filesystem (e.g. created by rsync/node >>>> image), instead of having a coherent distributed filesystem on all >>>> of the nodes then the filename makes sense. >>>> >>> Yes, this is the intent. >>> >> I would worry about programs which are using files which have been >> deleted, renamed, or (very common) renamed-over by another process >> after being opened, as there's a good chance they will successfully >> open the wrong file after c/r, and corrupt state from then on. >> > > Userspace is expected to back up and restore the filesystem, for > instance using a btrfs snapshot or a simple rsync or tar. > > That does not solve the problem Jamie is talking about. A rsync or a tar will not see a deleted file and using a btrfs to have the CR to work with the deleted files is a bit overkill, no ? I have another question about the deleted files. How is handled the case when a process has a deleted mapped file but without an associated file descriptor ? > If we detect anything which really is not supported (for instance > inotify for now) then we fail and leave a log message explaining the > failure. > ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-21 19:40 ` Serge E. Hallyn [not found] ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2010-03-21 20:58 ` Daniel Lezcano [not found] ` <4BA68884.3080003-GANU6spQydw@public.gmane.org> ` (2 more replies) 1 sibling, 3 replies; 88+ messages in thread From: Daniel Lezcano @ 2010-03-21 20:58 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Jamie Lokier, linux-fsdevel, containers, Andreas Dilger Serge E. Hallyn wrote: > Quoting Jamie Lokier (jamie@shareable.org): > >> Matt Helsley wrote: >> >>>> That said, if the intent is to allow the restore to be done on >>>> another node with a "similar" filesystem (e.g. created by rsync/node >>>> image), instead of having a coherent distributed filesystem on all >>>> of the nodes then the filename makes sense. >>>> >>> Yes, this is the intent. >>> >> I would worry about programs which are using files which have been >> deleted, renamed, or (very common) renamed-over by another process >> after being opened, as there's a good chance they will successfully >> open the wrong file after c/r, and corrupt state from then on. >> > > Userspace is expected to back up and restore the filesystem, for > instance using a btrfs snapshot or a simple rsync or tar. > > That does not solve the problem Jamie is talking about. A rsync or a tar will not see a deleted file and using a btrfs to have the CR to work with the deleted files is a bit overkill, no ? I have another question about the deleted files. How is handled the case when a process has a deleted mapped file but without an associated file descriptor ? > If we detect anything which really is not supported (for instance > inotify for now) then we fail and leave a log message explaining the > failure. > ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <4BA68884.3080003-GANU6spQydw@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <4BA68884.3080003-GANU6spQydw@public.gmane.org> @ 2010-03-21 21:36 ` Oren Laadan 2010-03-22 2:12 ` Matt Helsley 1 sibling, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-21 21:36 UTC (permalink / raw) To: Daniel Lezcano Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jamie Lokier, Andreas Dilger Daniel Lezcano wrote: > Serge E. Hallyn wrote: >> Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org): >> >>> Matt Helsley wrote: >>> >>>>> That said, if the intent is to allow the restore to be done on >>>>> another node with a "similar" filesystem (e.g. created by rsync/node >>>>> image), instead of having a coherent distributed filesystem on all >>>>> of the nodes then the filename makes sense. >>>>> >>>> Yes, this is the intent. >>>> >>> I would worry about programs which are using files which have been >>> deleted, renamed, or (very common) renamed-over by another process >>> after being opened, as there's a good chance they will successfully >>> open the wrong file after c/r, and corrupt state from then on. >>> >> Userspace is expected to back up and restore the filesystem, for >> instance using a btrfs snapshot or a simple rsync or tar. >> >> > That does not solve the problem Jamie is talking about. > A rsync or a tar will not see a deleted file and using a btrfs to have > the CR to work with the deleted files is a bit overkill, no ? Let's separate the issues of file system snapshot and deleted files. 1) File system snapshot: ------------------------ The requirement is to preserve the file system state between the time of the checkpoint and the time of the restart, because userspace will expect it to remain the same. The alternatives are: a) Use capable file system, like brfs, or (modified) nilfs. b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental) c) Assume/expect that the file system isn't modified between checkpoint and restart (e.g. if we use c/r to suspend a user's session) d) Expect userspace to adapt to changes if they occur, e.g. by having the application be aware of the possibility, or by providing a wrapper that will do some magic prior to restart (by looking at the checkpoint image). Options a,b,c are all transparent to the application, while option d required that applications become aware of c/r. That's ok, but our primary goal is to be generic enough to unmodified applications. 2) Deleted files: ----------------- The requirement is that at restart we'll be able to restore the file point in the kernel to a deleted file with same properties and contents as it was at the time of the checkpoint. The alternatives we considered are: e) For each deleted file, save the contents of that file as part of the checkpoint image; At restart - create a new file, populate with the contents, open it (to get an active file pointer), and finally unlink it, so it is - again - deleted. f) At checkpoint time, create a file (from scratch) in a dedicated area of the file system (userspace configurable?), and copy the contents of the deleted file to this file. Only save the file system state after this is done. At restart, open the alternative file instead, and then immediately delete it. g) At checkpoint time, re-link the file to a dedicated area of the file system. This requires support from the underlying file system, of course. For instance, it's trivial for ext2,3 but IIRC will need help for ext4. Re-linking is essentially attaching a new filename to an existing inode that is still referenced but is otherwise not reachable - and make it reachable again. At restart, open the re-linked file and then immediately delete it. > I have another question about the deleted files. How is handled the case > when a process has a deleted mapped file but without an associated file > descriptor ? > It works the same as with non-deleted files (assuming that we know how to handle delete files in general, e.g. options e,d,f above): To checkpoint a task's mm we loop through the vma's and checkpoint them. For a vma that corresponds to a mapped file, we first save the vma->vm_file. In turn, for a file pointer we save the filename, properties, credentials. A file pointer is saved as an independent object - and is assigned a unique id - objref. The state of the vma will indicate indicate this objref. At restart, we will first see the file pointer object, and will open the file to create a corresponding file pointer. Later when we restore the vma, we'll locate the (new) file pointer using the objref and use it in mmap. Oren. >> If we detect anything which really is not supported (for instance >> inotify for now) then we fail and leave a log message explaining the >> failure. >> > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linux-foundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <4BA68884.3080003-GANU6spQydw@public.gmane.org> 2010-03-21 21:36 ` Oren Laadan @ 2010-03-22 2:12 ` Matt Helsley 1 sibling, 0 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 2:12 UTC (permalink / raw) To: Daniel Lezcano Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jamie Lokier, Andreas Dilger On Sun, Mar 21, 2010 at 09:58:44PM +0100, Daniel Lezcano wrote: > Serge E. Hallyn wrote: > > Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org): > > > >> Matt Helsley wrote: > >> > >>>> That said, if the intent is to allow the restore to be done on > >>>> another node with a "similar" filesystem (e.g. created by rsync/node > >>>> image), instead of having a coherent distributed filesystem on all > >>>> of the nodes then the filename makes sense. > >>>> > >>> Yes, this is the intent. > >>> > >> I would worry about programs which are using files which have been > >> deleted, renamed, or (very common) renamed-over by another process > >> after being opened, as there's a good chance they will successfully > >> open the wrong file after c/r, and corrupt state from then on. > >> > > > > Userspace is expected to back up and restore the filesystem, for > > instance using a btrfs snapshot or a simple rsync or tar. > > > > > That does not solve the problem Jamie is talking about. > A rsync or a tar will not see a deleted file and using a btrfs to have > the CR to work with the deleted files is a bit overkill, no ? These are the same kinds of problems encountered during backup. You can play fast and loose -- like taking a backup while everything is running -- or you can play it conservative and freeze things. I think btrfs snapshots are just one possible solution and it's not overkill. For some filesystems it might make sense to use the filesystem freezer to ensure that no files are deleted while the backup takes place. Combined with tools like rsync or rdiff backup these operations could be low bandwidth and low latency if well-known live-migration techniques are used. Or use dm snapshots. I imagine fanotify could also be useful so long as userspace has marked things correctly prior to checkpoint. My high level understanding of fanotify was we'd be able to delay (or deny) deletion until checkpoint is complete. Or if using fanotify is unacceptable, at the very least we could use inotify to know when a file needed for restart has been deleted. It might go something like: start watching files/dirs needed (fanotify or inotify) Delay/deny changes (fanotify ONLY) freeze tasks for checkpoint freeze filesystem contents: take btrfs snapshots OR take dm snapshots OR use filesystem freezer OR backup filesystem contents sys_checkpoint check for changes to the filesystem contents and report failure if they interfere with restart (inotify ONLY) thaw filesystem contents thaw tasks So there are lots of possible solutions and they don't all involve trying to stop the whole VFS or the whole machine. They also don't require anything more in-kernel than what's already being pushed (our patchset, Eric Paris' patchset for the optional fanotify idea). > I have another question about the deleted files. How is handled the case > when a process has a deleted mapped file but without an associated file > descriptor ? The mapped file holds a struct file reference in the VMA. When checkpoint walks the VMAs the struct file is visited just like for struct files reached from file descriptors. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-21 20:58 ` Daniel Lezcano [not found] ` <4BA68884.3080003-GANU6spQydw@public.gmane.org> @ 2010-03-21 21:36 ` Oren Laadan 2010-03-22 8:40 ` Daniel Lezcano [not found] ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-22 2:12 ` Matt Helsley 2 siblings, 2 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-21 21:36 UTC (permalink / raw) To: Daniel Lezcano Cc: Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier, Andreas Dilger Daniel Lezcano wrote: > Serge E. Hallyn wrote: >> Quoting Jamie Lokier (jamie@shareable.org): >> >>> Matt Helsley wrote: >>> >>>>> That said, if the intent is to allow the restore to be done on >>>>> another node with a "similar" filesystem (e.g. created by rsync/node >>>>> image), instead of having a coherent distributed filesystem on all >>>>> of the nodes then the filename makes sense. >>>>> >>>> Yes, this is the intent. >>>> >>> I would worry about programs which are using files which have been >>> deleted, renamed, or (very common) renamed-over by another process >>> after being opened, as there's a good chance they will successfully >>> open the wrong file after c/r, and corrupt state from then on. >>> >> Userspace is expected to back up and restore the filesystem, for >> instance using a btrfs snapshot or a simple rsync or tar. >> >> > That does not solve the problem Jamie is talking about. > A rsync or a tar will not see a deleted file and using a btrfs to have > the CR to work with the deleted files is a bit overkill, no ? Let's separate the issues of file system snapshot and deleted files. 1) File system snapshot: ------------------------ The requirement is to preserve the file system state between the time of the checkpoint and the time of the restart, because userspace will expect it to remain the same. The alternatives are: a) Use capable file system, like brfs, or (modified) nilfs. b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental) c) Assume/expect that the file system isn't modified between checkpoint and restart (e.g. if we use c/r to suspend a user's session) d) Expect userspace to adapt to changes if they occur, e.g. by having the application be aware of the possibility, or by providing a wrapper that will do some magic prior to restart (by looking at the checkpoint image). Options a,b,c are all transparent to the application, while option d required that applications become aware of c/r. That's ok, but our primary goal is to be generic enough to unmodified applications. 2) Deleted files: ----------------- The requirement is that at restart we'll be able to restore the file point in the kernel to a deleted file with same properties and contents as it was at the time of the checkpoint. The alternatives we considered are: e) For each deleted file, save the contents of that file as part of the checkpoint image; At restart - create a new file, populate with the contents, open it (to get an active file pointer), and finally unlink it, so it is - again - deleted. f) At checkpoint time, create a file (from scratch) in a dedicated area of the file system (userspace configurable?), and copy the contents of the deleted file to this file. Only save the file system state after this is done. At restart, open the alternative file instead, and then immediately delete it. g) At checkpoint time, re-link the file to a dedicated area of the file system. This requires support from the underlying file system, of course. For instance, it's trivial for ext2,3 but IIRC will need help for ext4. Re-linking is essentially attaching a new filename to an existing inode that is still referenced but is otherwise not reachable - and make it reachable again. At restart, open the re-linked file and then immediately delete it. > I have another question about the deleted files. How is handled the case > when a process has a deleted mapped file but without an associated file > descriptor ? > It works the same as with non-deleted files (assuming that we know how to handle delete files in general, e.g. options e,d,f above): To checkpoint a task's mm we loop through the vma's and checkpoint them. For a vma that corresponds to a mapped file, we first save the vma->vm_file. In turn, for a file pointer we save the filename, properties, credentials. A file pointer is saved as an independent object - and is assigned a unique id - objref. The state of the vma will indicate indicate this objref. At restart, we will first see the file pointer object, and will open the file to create a corresponding file pointer. Later when we restore the vma, we'll locate the (new) file pointer using the objref and use it in mmap. Oren. >> If we detect anything which really is not supported (for instance >> inotify for now) then we fail and leave a log message explaining the >> failure. >> > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-21 21:36 ` Oren Laadan @ 2010-03-22 8:40 ` Daniel Lezcano [not found] ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 1 sibling, 0 replies; 88+ messages in thread From: Daniel Lezcano @ 2010-03-22 8:40 UTC (permalink / raw) To: Oren Laadan Cc: Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier, Andreas Dilger Oren Laadan wrote: > > > Daniel Lezcano wrote: >> Serge E. Hallyn wrote: >>> Quoting Jamie Lokier (jamie@shareable.org): >>> >>>> Matt Helsley wrote: >>>> >>>>>> That said, if the intent is to allow the restore to be done on >>>>>> another node with a "similar" filesystem (e.g. created by rsync/node >>>>>> image), instead of having a coherent distributed filesystem on all >>>>>> of the nodes then the filename makes sense. >>>>>> >>>>> Yes, this is the intent. >>>>> >>>> I would worry about programs which are using files which have been >>>> deleted, renamed, or (very common) renamed-over by another process >>>> after being opened, as there's a good chance they will successfully >>>> open the wrong file after c/r, and corrupt state from then on. >>>> >>> Userspace is expected to back up and restore the filesystem, for >>> instance using a btrfs snapshot or a simple rsync or tar. >>> >>> >> That does not solve the problem Jamie is talking about. >> A rsync or a tar will not see a deleted file and using a btrfs to >> have the CR to work with the deleted files is a bit overkill, no ? > > Let's separate the issues of file system snapshot and deleted files. > > 1) File system snapshot: > ------------------------ > The requirement is to preserve the file system state between the time > of the checkpoint and the time of the restart, because userspace will > expect it to remain the same. > > The alternatives are: > > a) Use capable file system, like brfs, or (modified) nilfs. > > b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental) > > c) Assume/expect that the file system isn't modified between checkpoint > and restart (e.g. if we use c/r to suspend a user's session) > > d) Expect userspace to adapt to changes if they occur, e.g. by having > the application be aware of the possibility, or by providing a wrapper > that will do some magic prior to restart (by looking at the checkpoint > image). > > Options a,b,c are all transparent to the application, while option > d required that applications become aware of c/r. That's ok, but our > primary goal is to be generic enough to unmodified applications. > > 2) Deleted files: > ----------------- > The requirement is that at restart we'll be able to restore the file > point in the kernel to a deleted file with same properties and contents > as it was at the time of the checkpoint. > > The alternatives we considered are: > > e) For each deleted file, save the contents of that file as part of > the checkpoint image; > At restart - create a new file, populate with the contents, open it > (to get an active file pointer), and finally unlink it, so it is - > again - deleted. > > f) At checkpoint time, create a file (from scratch) in a dedicated > area of the file system (userspace configurable?), and copy the > contents of the deleted file to this file. Only save the file system > state after this is done. > At restart, open the alternative file instead, and then immediately > delete it. > > g) At checkpoint time, re-link the file to a dedicated area of the > file system. This requires support from the underlying file system, > of course. For instance, it's trivial for ext2,3 but IIRC will need > help for ext4. Re-linking is essentially attaching a new filename > to an existing inode that is still referenced but is otherwise not > reachable - and make it reachable again. > At restart, open the re-linked file and then immediately delete it. > >> I have another question about the deleted files. How is handled the >> case when a process has a deleted mapped file but without an >> associated file descriptor ? >> > > It works the same as with non-deleted files (assuming that we know > how to handle delete files in general, e.g. options e,d,f above): > > To checkpoint a task's mm we loop through the vma's and checkpoint > them. For a vma that corresponds to a mapped file, we first save > the vma->vm_file. In turn, for a file pointer we save the filename, > properties, credentials. A file pointer is saved as an independent > object - and is assigned a unique id - objref. The state of the vma > will indicate indicate this objref. > > At restart, we will first see the file pointer object, and will > open the file to create a corresponding file pointer. Later when > we restore the vma, we'll locate the (new) file pointer using the > objref and use it in mmap. > > Oren. > Thanks Oren for the detailed answer. ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-21 23:31 ` xing lin 2010-03-22 8:40 ` Daniel Lezcano 1 sibling, 0 replies; 88+ messages in thread From: xing lin @ 2010-03-21 23:31 UTC (permalink / raw) To: Oren Laadan Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger, Jamie Lokier Hi, I am Xing, a PHD candidate in University of Utah. I am also quite interested in container-based virtualization. That's why I register this email list. :) Now, I am working on Container migration of OpenVZ in Emulab. I just begin to hack OpenVZ kernel. On Sun, Mar 21, 2010 at 3:36 PM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote: > > Let's separate the issues of file system snapshot and deleted files. > > 1) File system snapshot: > ------------------------ > The requirement is to preserve the file system state between the time > of the checkpoint and the time of the restart, because userspace will > expect it to remain the same. > > The alternatives are: > > a) Use capable file system, like brfs, or (modified) nilfs. > Do you mean btrfs? These two file systems both support snapshot. Sound great. b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental) > > c) Assume/expect that the file system isn't modified between checkpoint > and restart (e.g. if we use c/r to suspend a user's session) > This is what OpenVZ does. OpenVZ assumes the underlying file system to be consistent during checkpoint and restart. If the file does not exist when restoring the container, the restoring will fail(It will give a message to show which file can not be found). OpenVZ also does not support nfs. If a nfs is mounted in the container's file system, this container can not be suspended. Since we want to enable container migration in Emulab, so we are trying to solve these issues. nfs is the big issue since almost all users will store their files at their home directories which are mounted from the nfs server. We are still discussing how to deal with this. -- Regards, Xing School of Computing, University of Utah http://www.cs.utah.edu/~xinglin/ ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-21 23:31 ` xing lin @ 2010-03-22 8:40 ` Daniel Lezcano 1 sibling, 0 replies; 88+ messages in thread From: Daniel Lezcano @ 2010-03-22 8:40 UTC (permalink / raw) To: Oren Laadan Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jamie Lokier, Andreas Dilger Oren Laadan wrote: > > > Daniel Lezcano wrote: >> Serge E. Hallyn wrote: >>> Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org): >>> >>>> Matt Helsley wrote: >>>> >>>>>> That said, if the intent is to allow the restore to be done on >>>>>> another node with a "similar" filesystem (e.g. created by rsync/node >>>>>> image), instead of having a coherent distributed filesystem on all >>>>>> of the nodes then the filename makes sense. >>>>>> >>>>> Yes, this is the intent. >>>>> >>>> I would worry about programs which are using files which have been >>>> deleted, renamed, or (very common) renamed-over by another process >>>> after being opened, as there's a good chance they will successfully >>>> open the wrong file after c/r, and corrupt state from then on. >>>> >>> Userspace is expected to back up and restore the filesystem, for >>> instance using a btrfs snapshot or a simple rsync or tar. >>> >>> >> That does not solve the problem Jamie is talking about. >> A rsync or a tar will not see a deleted file and using a btrfs to >> have the CR to work with the deleted files is a bit overkill, no ? > > Let's separate the issues of file system snapshot and deleted files. > > 1) File system snapshot: > ------------------------ > The requirement is to preserve the file system state between the time > of the checkpoint and the time of the restart, because userspace will > expect it to remain the same. > > The alternatives are: > > a) Use capable file system, like brfs, or (modified) nilfs. > > b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental) > > c) Assume/expect that the file system isn't modified between checkpoint > and restart (e.g. if we use c/r to suspend a user's session) > > d) Expect userspace to adapt to changes if they occur, e.g. by having > the application be aware of the possibility, or by providing a wrapper > that will do some magic prior to restart (by looking at the checkpoint > image). > > Options a,b,c are all transparent to the application, while option > d required that applications become aware of c/r. That's ok, but our > primary goal is to be generic enough to unmodified applications. > > 2) Deleted files: > ----------------- > The requirement is that at restart we'll be able to restore the file > point in the kernel to a deleted file with same properties and contents > as it was at the time of the checkpoint. > > The alternatives we considered are: > > e) For each deleted file, save the contents of that file as part of > the checkpoint image; > At restart - create a new file, populate with the contents, open it > (to get an active file pointer), and finally unlink it, so it is - > again - deleted. > > f) At checkpoint time, create a file (from scratch) in a dedicated > area of the file system (userspace configurable?), and copy the > contents of the deleted file to this file. Only save the file system > state after this is done. > At restart, open the alternative file instead, and then immediately > delete it. > > g) At checkpoint time, re-link the file to a dedicated area of the > file system. This requires support from the underlying file system, > of course. For instance, it's trivial for ext2,3 but IIRC will need > help for ext4. Re-linking is essentially attaching a new filename > to an existing inode that is still referenced but is otherwise not > reachable - and make it reachable again. > At restart, open the re-linked file and then immediately delete it. > >> I have another question about the deleted files. How is handled the >> case when a process has a deleted mapped file but without an >> associated file descriptor ? >> > > It works the same as with non-deleted files (assuming that we know > how to handle delete files in general, e.g. options e,d,f above): > > To checkpoint a task's mm we loop through the vma's and checkpoint > them. For a vma that corresponds to a mapped file, we first save > the vma->vm_file. In turn, for a file pointer we save the filename, > properties, credentials. A file pointer is saved as an independent > object - and is assigned a unique id - objref. The state of the vma > will indicate indicate this objref. > > At restart, we will first see the file pointer object, and will > open the file to create a corresponding file pointer. Later when > we restore the vma, we'll locate the (new) file pointer using the > objref and use it in mmap. > > Oren. > Thanks Oren for the detailed answer. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-21 20:58 ` Daniel Lezcano [not found] ` <4BA68884.3080003-GANU6spQydw@public.gmane.org> 2010-03-21 21:36 ` Oren Laadan @ 2010-03-22 2:12 ` Matt Helsley [not found] ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> ` (2 more replies) 2 siblings, 3 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 2:12 UTC (permalink / raw) To: Daniel Lezcano Cc: Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier, Andreas Dilger On Sun, Mar 21, 2010 at 09:58:44PM +0100, Daniel Lezcano wrote: > Serge E. Hallyn wrote: > > Quoting Jamie Lokier (jamie@shareable.org): > > > >> Matt Helsley wrote: > >> > >>>> That said, if the intent is to allow the restore to be done on > >>>> another node with a "similar" filesystem (e.g. created by rsync/node > >>>> image), instead of having a coherent distributed filesystem on all > >>>> of the nodes then the filename makes sense. > >>>> > >>> Yes, this is the intent. > >>> > >> I would worry about programs which are using files which have been > >> deleted, renamed, or (very common) renamed-over by another process > >> after being opened, as there's a good chance they will successfully > >> open the wrong file after c/r, and corrupt state from then on. > >> > > > > Userspace is expected to back up and restore the filesystem, for > > instance using a btrfs snapshot or a simple rsync or tar. > > > > > That does not solve the problem Jamie is talking about. > A rsync or a tar will not see a deleted file and using a btrfs to have > the CR to work with the deleted files is a bit overkill, no ? These are the same kinds of problems encountered during backup. You can play fast and loose -- like taking a backup while everything is running -- or you can play it conservative and freeze things. I think btrfs snapshots are just one possible solution and it's not overkill. For some filesystems it might make sense to use the filesystem freezer to ensure that no files are deleted while the backup takes place. Combined with tools like rsync or rdiff backup these operations could be low bandwidth and low latency if well-known live-migration techniques are used. Or use dm snapshots. I imagine fanotify could also be useful so long as userspace has marked things correctly prior to checkpoint. My high level understanding of fanotify was we'd be able to delay (or deny) deletion until checkpoint is complete. Or if using fanotify is unacceptable, at the very least we could use inotify to know when a file needed for restart has been deleted. It might go something like: start watching files/dirs needed (fanotify or inotify) Delay/deny changes (fanotify ONLY) freeze tasks for checkpoint freeze filesystem contents: take btrfs snapshots OR take dm snapshots OR use filesystem freezer OR backup filesystem contents sys_checkpoint check for changes to the filesystem contents and report failure if they interfere with restart (inotify ONLY) thaw filesystem contents thaw tasks So there are lots of possible solutions and they don't all involve trying to stop the whole VFS or the whole machine. They also don't require anything more in-kernel than what's already being pushed (our patchset, Eric Paris' patchset for the optional fanotify idea). > I have another question about the deleted files. How is handled the case > when a process has a deleted mapped file but without an associated file > descriptor ? The mapped file holds a struct file reference in the VMA. When checkpoint walks the VMAs the struct file is visited just like for struct files reached from file descriptors. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 13:51 ` Jamie Lokier 2010-03-22 23:18 ` Andreas Dilger 1 sibling, 0 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-22 13:51 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger Matt Helsley wrote: > These are the same kinds of problems encountered during backup. You > can play fast and loose -- like taking a backup while everything is > running -- or you can play it conservative and freeze things. Not really. The issue isn't files getting deleted during the checkpoint, it's files deleted or renamed over _prior_ to beginning checkpoint. That's a common situation. For example if someone did a software package update, you can easily have processes which reference deleted files running for months. Same if a program keeps open a data file which is edited by a text editor, which renames when saving. Etc, etc. > I think btrfs snapshots are just one possible solution and it's not > overkill. I don't think btrfs snapshots solves the problem anyway, unless you also have a way to look up a file by inode number or equivalent, or the other ideas discussed such as making a link to a deleted file. Note that it isn't _just_ deleted files. The name in question may be deleted but there may still be other links to the file. Or it could be opened via different link names, some or all of which have been deleted or renamed over. In thoses cases it would be a bug to make a copy of the deleted file in the checkpoint state, or in the filesystem, as were mentioned earlier... > I imagine fanotify could also be useful so long as userspace has marked > things correctly prior to checkpoint. My high level understanding of > fanotify was we'd be able to delay (or deny) deletion until checkpoint > is complete. Yes, that might be a way to block filesystem changes during checkpoint, although fanotify's capabilities weren't complete enough for this, last time I looked. (It didn't give sufficient information directory operations.) -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-03-22 13:51 ` Jamie Lokier @ 2010-03-22 23:18 ` Andreas Dilger 1 sibling, 0 replies; 88+ messages in thread From: Andreas Dilger @ 2010-03-22 23:18 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jamie Lokier On 2010-03-21, at 20:12, Matt Helsley wrote: > These are the same kinds of problems encountered during backup. You > can play fast and loose -- like taking a backup while everything is > running -- or you can play it conservative and freeze things. > > I think btrfs snapshots are just one possible solution and it's not > overkill. > > For some filesystems it might make sense to use the filesystem > freezer to > ensure that no files are deleted while the backup takes place. > Combined > with tools like rsync or rdiff backup these operations could be low > bandwidth > and low latency if well-known live-migration techniques are used. > > Or use dm snapshots. If you are using snapshots, then even an open-unlinked file will not be deleted from the filesystem until it is closed, because the inode will still be available on disk even without the filename. That would be a good reason to also store the file handle (e.g. inode+generation for simple filesystems) in the checkpoint file, so that you can re- open this file by the file handle after the process is restarted. Since Aneesh is starting to add an interface for this to the kernel anyway, I don't think it would be very hard to dump/restore a handful of extra bytes with each file. Conversely, now is the time for getting the open-by-handle APIs correct for this code. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 2:12 ` Matt Helsley [not found] ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 13:51 ` Jamie Lokier 2010-03-22 23:18 ` Andreas Dilger 2 siblings, 0 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-22 13:51 UTC (permalink / raw) To: Matt Helsley Cc: Daniel Lezcano, Serge E. Hallyn, linux-fsdevel, containers, Andreas Dilger Matt Helsley wrote: > These are the same kinds of problems encountered during backup. You > can play fast and loose -- like taking a backup while everything is > running -- or you can play it conservative and freeze things. Not really. The issue isn't files getting deleted during the checkpoint, it's files deleted or renamed over _prior_ to beginning checkpoint. That's a common situation. For example if someone did a software package update, you can easily have processes which reference deleted files running for months. Same if a program keeps open a data file which is edited by a text editor, which renames when saving. Etc, etc. > I think btrfs snapshots are just one possible solution and it's not > overkill. I don't think btrfs snapshots solves the problem anyway, unless you also have a way to look up a file by inode number or equivalent, or the other ideas discussed such as making a link to a deleted file. Note that it isn't _just_ deleted files. The name in question may be deleted but there may still be other links to the file. Or it could be opened via different link names, some or all of which have been deleted or renamed over. In thoses cases it would be a bug to make a copy of the deleted file in the checkpoint state, or in the filesystem, as were mentioned earlier... > I imagine fanotify could also be useful so long as userspace has marked > things correctly prior to checkpoint. My high level understanding of > fanotify was we'd be able to delay (or deny) deletion until checkpoint > is complete. Yes, that might be a way to block filesystem changes during checkpoint, although fanotify's capabilities weren't complete enough for this, last time I looked. (It didn't give sufficient information directory operations.) -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 2:12 ` Matt Helsley [not found] ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-03-22 13:51 ` Jamie Lokier @ 2010-03-22 23:18 ` Andreas Dilger 2 siblings, 0 replies; 88+ messages in thread From: Andreas Dilger @ 2010-03-22 23:18 UTC (permalink / raw) To: Matt Helsley Cc: Daniel Lezcano, Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier On 2010-03-21, at 20:12, Matt Helsley wrote: > These are the same kinds of problems encountered during backup. You > can play fast and loose -- like taking a backup while everything is > running -- or you can play it conservative and freeze things. > > I think btrfs snapshots are just one possible solution and it's not > overkill. > > For some filesystems it might make sense to use the filesystem > freezer to > ensure that no files are deleted while the backup takes place. > Combined > with tools like rsync or rdiff backup these operations could be low > bandwidth > and low latency if well-known live-migration techniques are used. > > Or use dm snapshots. If you are using snapshots, then even an open-unlinked file will not be deleted from the filesystem until it is closed, because the inode will still be available on disk even without the filename. That would be a good reason to also store the file handle (e.g. inode+generation for simple filesystems) in the checkpoint file, so that you can re- open this file by the file handle after the process is restarted. Since Aneesh is starting to add an interface for this to the kernel anyway, I don't think it would be very hard to dump/restore a handful of extra bytes with each file. Conversely, now is the time for getting the open-by-handle APIs correct for this code. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-21 17:27 ` Jamie Lokier [not found] ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org> 2010-03-21 19:40 ` Serge E. Hallyn @ 2010-03-22 1:06 ` Matt Helsley [not found] ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> ` (2 more replies) 2 siblings, 3 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 1:06 UTC (permalink / raw) To: Jamie Lokier Cc: Matt Helsley, Andreas Dilger, Oren Laadan, linux-fsdevel, containers On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > Matt Helsley wrote: > > > That said, if the intent is to allow the restore to be done on > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > image), instead of having a coherent distributed filesystem on all > > > of the nodes then the filename makes sense. > > > > Yes, this is the intent. > > I would worry about programs which are using files which have been > deleted, renamed, or (very common) renamed-over by another process > after being opened, as there's a good chance they will successfully > open the wrong file after c/r, and corrupt state from then on. The code in the patches does check for unlinked files and refuses to checkpoint if an unlinked file is open. Yes, this limits the usefulness of the code somewhat but it's a problem we can solve and c/r is still quite useful without the solution. My favorite solution for unlinked files is keeping the contents of the file in the checkpoint image. Another solution is relinking it to a new "safe" location in the filesystem. Determining the "safe" location is not very clean because we need one "safe" location per filesystem being backed-up. Hence I tend to favor the first approach. Neither solution is implemented and thoroughly tested yet though. These solutions are needed because the data is not available via a normal filesystem backup. Renames are dealt with by requiring userspace to freeze and/or safely take a snapshot of the filesystem as with any backup. > This can be avoided by ensuring every checkpointed application is > specially "c/r aware", but that makes the feature a lot less > attractive, as well as uncomfortably unsafe to use on arbitrary We avoided using that solution for the very flaws you point out. In fact, so far we've managed to avoid requiring cooperation with the tasks being checkpointed. > processes. Ideally, c/r would fail on some types of process > (e.g. using sockets), but at least fail in a safe way that does not > lead to quiet data corruption. We've done our best to try and reach that ideal. You're welcome to have a look at the code to see if you can find any ways in which we haven't. Here's the code that refuses to checkpoint unsupported files. I think it's pretty easy to read: int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) { struct file *file = (struct file *) ptr; int ret; if (!file->f_op || !file->f_op->checkpoint) { ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", file, file->f_op); return -EBADF; } if (is_dnotify_attached(file)) { ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file); return -EBADF; } ret = file->f_op->checkpoint(ctx, file); if (ret < 0) ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); return ret; } (As Serge noted, we don't support inotify. inotify and fanotify require an fd to register the fsnotify marks and the struct file associated with that fd lacks the f_ops->checkpoint operation, hence that will cause checkpoint to fail too and, again, there will be no silent corruption) Negative return values cause sys_checkpoint() to stop checkpointing and return the given errno. The f_op->checkpoint is often a generic operation which ensures that the file is not unlinked before it saves things like the position of the file (checkpoint_file_common()) and the path to the file (checkpoint_fname()): int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) { struct ckpt_hdr_file_generic *h; int ret; /* * FIXME: when we'll add support for unlinked files/dirs, we'll * need to distinguish between unlinked filed and unlinked dirs. */ if (d_unlinked(file->f_dentry)) { ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", file); return -EBADF; } h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); if (!h) return -ENOMEM; h->common.f_type = CKPT_FILE_GENERIC; ret = checkpoint_file_common(ctx, file, &h->common); if (ret < 0) goto out; ret = ckpt_write_obj(ctx, &h->common.h); if (ret < 0) goto out; ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); out: ckpt_hdr_put(ctx, h); return ret; } EXPORT_SYMBOL(generic_file_checkpoint); I wrote a simple script to look for missing operations in things like file_operations. It can output counts in directories/files or show the spot in the files where the struct is defined and a little context. I used that script to check which files and protocols aren't supported (for 2.6.33-rc8), I placed a histogram of the output in the wiki, and I've tried to keep it up-to-date. https://ckpt.wiki.kernel.org/index.php/UncheckpointableFilesystems https://ckpt.wiki.kernel.org/index.php/UncheckpointableProtocols The script is also there for anyone who wants to use it on newer kernels. Here's the output which is of interest to folks on linux-fsdevel for anyone who doesn't wish to follow a link -- the number of file_operations structures missing the .checkpoint operation: 162 arch 3 block 1 crypto 1 Documentation 718 drivers 178 fs 3 9p 8 afs 1 autofs 3 autofs4 1 bad_inode.c 3 binfmt_misc.c 1 block_dev.c 2 cachefiles 1 char_dev.c 15 cifs 4 coda 2 configfs 3 debugfs 8 dlm 1 ext4 1 fifo.c 1 filesystems.c 3 fscache 9 fuse 5 gfs2 1 hugetlbfs 1 jbd2 6 jfs 1 libfs.c 1 locks.c 2 ncpfs 2 nfs 5 nfsd 1 no-block.c 1 notify 1 ntfs 15 ocfs2 55 proc 1 reiserfs 1 signalfd.c 2 smbfs 3 sysfs 1 timerfd.c 3 xfs 1 include 4 ipc 88 kernel 3 lib 12 mm 164 net 1 samples 35 security 29 sound 4 virt Notes: 1. The missing checkpoint file operation in fs/fifo.c is only an artifact of the unusual way fifo file ops are assigned. FIFOs are supported. 2. The ext4 missing file operation is for the multiblock groups file in /proc IMHO trying to checkpoint the contents of /proc files is usually a bad idea. Thankfuly, most programs don't hold these files open for very long. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 2:20 ` Jamie Lokier 2010-03-22 2:55 ` Serge E. Hallyn 1 sibling, 0 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-22 2:20 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Matt Helsley wrote: > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > > Matt Helsley wrote: > > > > That said, if the intent is to allow the restore to be done on > > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > > image), instead of having a coherent distributed filesystem on all > > > > of the nodes then the filename makes sense. > > > > > > Yes, this is the intent. > > > > I would worry about programs which are using files which have been > > deleted, renamed, or (very common) renamed-over by another process > > after being opened, as there's a good chance they will successfully > > open the wrong file after c/r, and corrupt state from then on. > > The code in the patches does check for unlinked files and refuses > to checkpoint if an unlinked file is open. Yes, this limits the usefulness > of the code somewhat but it's a problem we can solve and c/r is still quite > useful without the solution. > > We've done our best to try and reach that ideal. You're welcome to have a > look at the code to see if you can find any ways in which we haven't. > Here's the code that refuses to checkpoint unsupported files. I think > it's pretty easy to read: From a very quick read, > if (d_unlinked(file->f_dentry)) { > ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", > file); Hmm. I wonder if d_unlinked() is always true for a file which is opened, unlinked or renamed over, but has a hard link to it from elsewhere so the on-disk file hasn't gone away. I guess it probably is. That's kinda neat! I'd hoped there would be a good reason for f_dentry eventually ;-) What about files opened through /proc/self/fd/N before or after the original file was unlinked/renamed-over. Where does the dentry point? -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-03-22 2:20 ` Jamie Lokier @ 2010-03-22 2:55 ` Serge E. Hallyn 1 sibling, 0 replies; 88+ messages in thread From: Serge E. Hallyn @ 2010-03-22 2:55 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger, Jamie Lokier, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > > Matt Helsley wrote: > > > > That said, if the intent is to allow the restore to be done on > > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > > image), instead of having a coherent distributed filesystem on all > > > > of the nodes then the filename makes sense. > > > > > > Yes, this is the intent. > > > > I would worry about programs which are using files which have been > > deleted, renamed, or (very common) renamed-over by another process > > after being opened, as there's a good chance they will successfully > > open the wrong file after c/r, and corrupt state from then on. > > The code in the patches does check for unlinked files and refuses > to checkpoint if an unlinked file is open. Yes, this limits the usefulness Oh, haha - open/mapped unlinked files. Sorry :) -serge ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 1:06 ` Matt Helsley [not found] ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 2:20 ` Jamie Lokier [not found] ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org> 2010-03-22 3:37 ` Matt Helsley 2010-03-22 2:55 ` Serge E. Hallyn 2 siblings, 2 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-22 2:20 UTC (permalink / raw) To: Matt Helsley; +Cc: Andreas Dilger, Oren Laadan, linux-fsdevel, containers Matt Helsley wrote: > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > > Matt Helsley wrote: > > > > That said, if the intent is to allow the restore to be done on > > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > > image), instead of having a coherent distributed filesystem on all > > > > of the nodes then the filename makes sense. > > > > > > Yes, this is the intent. > > > > I would worry about programs which are using files which have been > > deleted, renamed, or (very common) renamed-over by another process > > after being opened, as there's a good chance they will successfully > > open the wrong file after c/r, and corrupt state from then on. > > The code in the patches does check for unlinked files and refuses > to checkpoint if an unlinked file is open. Yes, this limits the usefulness > of the code somewhat but it's a problem we can solve and c/r is still quite > useful without the solution. > > We've done our best to try and reach that ideal. You're welcome to have a > look at the code to see if you can find any ways in which we haven't. > Here's the code that refuses to checkpoint unsupported files. I think > it's pretty easy to read: >From a very quick read, > if (d_unlinked(file->f_dentry)) { > ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", > file); Hmm. I wonder if d_unlinked() is always true for a file which is opened, unlinked or renamed over, but has a hard link to it from elsewhere so the on-disk file hasn't gone away. I guess it probably is. That's kinda neat! I'd hoped there would be a good reason for f_dentry eventually ;-) What about files opened through /proc/self/fd/N before or after the original file was unlinked/renamed-over. Where does the dentry point? -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org> @ 2010-03-22 3:37 ` Matt Helsley 0 siblings, 0 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 3:37 UTC (permalink / raw) To: Jamie Lokier Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Mon, Mar 22, 2010 at 02:20:03AM +0000, Jamie Lokier wrote: > Matt Helsley wrote: > > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > > > Matt Helsley wrote: > > > > > That said, if the intent is to allow the restore to be done on > > > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > > > image), instead of having a coherent distributed filesystem on all > > > > > of the nodes then the filename makes sense. > > > > > > > > Yes, this is the intent. > > > > > > I would worry about programs which are using files which have been > > > deleted, renamed, or (very common) renamed-over by another process > > > after being opened, as there's a good chance they will successfully > > > open the wrong file after c/r, and corrupt state from then on. > > > > The code in the patches does check for unlinked files and refuses > > to checkpoint if an unlinked file is open. Yes, this limits the usefulness > > of the code somewhat but it's a problem we can solve and c/r is still quite > > useful without the solution. > > > > We've done our best to try and reach that ideal. You're welcome to have a > > look at the code to see if you can find any ways in which we haven't. > > Here's the code that refuses to checkpoint unsupported files. I think > > it's pretty easy to read: > > From a very quick read, > > > if (d_unlinked(file->f_dentry)) { > > ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", > > file); > > Hmm. > > I wonder if d_unlinked() is always true for a file which is opened, > unlinked or renamed over, but has a hard link to it from elsewhere so > the on-disk file hasn't gone away. Well, if the on-disk file hasn't gone away due to a hardlink then we won't need to save the file in the checkpoint image -- the filesystem content backup done during checkpoint should also get the file contents. > > I guess it probably is. That's kinda neat! I'd hoped there would be a > good reason for f_dentry eventually ;-) > > What about files opened through /proc/self/fd/N before or after the > original file was unlinked/renamed-over. Where does the dentry point? Before the unlink it will result in the same file being opened. If it's opened by a task being checkpointed then we'll be in the same situation as the "self" task. If it's opened by a task not being checkpointed then the "leak detection" code will notice that there's an unaccounted reference to the file and checkpoint will fail. That code is in checkpoint/objhash.c. It works by doing two passes: 1. Collect references 2. Checkpoint referenced objects We only do the second pass if the ref count matches the number of times we've "collected" the file (I added comments to the .ref_foo = ops so you don't need to see them to get the idea): static struct ckpt_obj_ops ckpt_obj_ops[] = { ... /* file object */ { .obj_name = "FILE", .obj_type = CKPT_OBJ_FILE, .ref_drop = obj_file_drop, /* aka fput */ .ref_grab = obj_file_grab, /* aka get_file */ .ref_users = obj_file_users, /* does atomic read of f_count */ .checkpoint = checkpoint_file, .restore = restore_file, }, ... }; ... /** * ckpt_obj_contained - test if shared objects are contained in checkpoint * @ctx: checkpoint context * * Loops through all objects in the table and compares the number of * references accumulated during checkpoint, with the reference count * reported by the kernel. * * Return 1 if respective counts match for all objects, 0 otherwise. */ int ckpt_obj_contained(struct ckpt_ctx *ctx) { struct ckpt_obj *obj; struct hlist_node *node; /* account for ctx->{file,logfile} (if in the table already) */ ckpt_obj_users_inc(ctx, ctx->file, 1); if (ctx->logfile) ckpt_obj_users_inc(ctx, ctx->logfile, 1); /* account for ctx->root_nsproxy (if in the table already) */ ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1); hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) { if (!obj->ops->ref_users) continue; if (obj->ops->obj_type == CKPT_OBJ_SOCK) obj_sock_adjust_users(obj); if (obj->ops->ref_users(obj->ptr) != obj->users) { ckpt_err(ctx, -EBUSY, "%(O)%(P)%(S)Usage leak (%d != %d)\n", obj->objref, obj->ptr, obj->ops->obj_name, obj->ops->ref_users(obj->ptr), obj->users); return 0; } } return 1; } ... So that hopefully addresses your questions regarding the use of the symlinks before the unlink. After the unlink those symlinks are broken since they have "(deleted)" appended. Making sure they are broken after restart is one detail I've thought about. To make it perfect I think we could: 1. Move any existing file at the original symlinked path to a temporary location. 2. Restore the "unlinked" file to that location. (in quotes since it's not unlinked yet) 3. Open the "unlinked" file. 4. Unlink the file again. 5. Move the existing file back from the temporary location. As with relinking, we need a good way to do the "temporary location". That is complicated because we need to choose a location that we have permission to write to, always exists during restart, and is guaranteed not to have files in it. Relinking the file shifts these problems from restart to checkpoint. In case you're bored, before Oren posted these patches I wrote: https://ckpt.wiki.kernel.org/index.php/Checklist/UnlinkedFiles and there's lots of info related to what we do and don't support, many related to files in one way or another, in the table at: https://ckpt.wiki.kernel.org/index.php/Checklist I'll update that page with some of my responses above. Getting your thoughts on my ideas outlined above would be excellent. If you've got some counter proposals I'd be happy to hear them too. I'll add a reference to this thread and an edited collection of my rambling responses to that page if you like. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 2:20 ` Jamie Lokier [not found] ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org> @ 2010-03-22 3:37 ` Matt Helsley [not found] ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-03-22 14:13 ` Jamie Lokier 1 sibling, 2 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 3:37 UTC (permalink / raw) To: Jamie Lokier Cc: Matt Helsley, Andreas Dilger, Oren Laadan, linux-fsdevel, containers On Mon, Mar 22, 2010 at 02:20:03AM +0000, Jamie Lokier wrote: > Matt Helsley wrote: > > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > > > Matt Helsley wrote: > > > > > That said, if the intent is to allow the restore to be done on > > > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > > > image), instead of having a coherent distributed filesystem on all > > > > > of the nodes then the filename makes sense. > > > > > > > > Yes, this is the intent. > > > > > > I would worry about programs which are using files which have been > > > deleted, renamed, or (very common) renamed-over by another process > > > after being opened, as there's a good chance they will successfully > > > open the wrong file after c/r, and corrupt state from then on. > > > > The code in the patches does check for unlinked files and refuses > > to checkpoint if an unlinked file is open. Yes, this limits the usefulness > > of the code somewhat but it's a problem we can solve and c/r is still quite > > useful without the solution. > > > > We've done our best to try and reach that ideal. You're welcome to have a > > look at the code to see if you can find any ways in which we haven't. > > Here's the code that refuses to checkpoint unsupported files. I think > > it's pretty easy to read: > > From a very quick read, > > > if (d_unlinked(file->f_dentry)) { > > ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", > > file); > > Hmm. > > I wonder if d_unlinked() is always true for a file which is opened, > unlinked or renamed over, but has a hard link to it from elsewhere so > the on-disk file hasn't gone away. Well, if the on-disk file hasn't gone away due to a hardlink then we won't need to save the file in the checkpoint image -- the filesystem content backup done during checkpoint should also get the file contents. > > I guess it probably is. That's kinda neat! I'd hoped there would be a > good reason for f_dentry eventually ;-) > > What about files opened through /proc/self/fd/N before or after the > original file was unlinked/renamed-over. Where does the dentry point? Before the unlink it will result in the same file being opened. If it's opened by a task being checkpointed then we'll be in the same situation as the "self" task. If it's opened by a task not being checkpointed then the "leak detection" code will notice that there's an unaccounted reference to the file and checkpoint will fail. That code is in checkpoint/objhash.c. It works by doing two passes: 1. Collect references 2. Checkpoint referenced objects We only do the second pass if the ref count matches the number of times we've "collected" the file (I added comments to the .ref_foo = ops so you don't need to see them to get the idea): static struct ckpt_obj_ops ckpt_obj_ops[] = { ... /* file object */ { .obj_name = "FILE", .obj_type = CKPT_OBJ_FILE, .ref_drop = obj_file_drop, /* aka fput */ .ref_grab = obj_file_grab, /* aka get_file */ .ref_users = obj_file_users, /* does atomic read of f_count */ .checkpoint = checkpoint_file, .restore = restore_file, }, ... }; ... /** * ckpt_obj_contained - test if shared objects are contained in checkpoint * @ctx: checkpoint context * * Loops through all objects in the table and compares the number of * references accumulated during checkpoint, with the reference count * reported by the kernel. * * Return 1 if respective counts match for all objects, 0 otherwise. */ int ckpt_obj_contained(struct ckpt_ctx *ctx) { struct ckpt_obj *obj; struct hlist_node *node; /* account for ctx->{file,logfile} (if in the table already) */ ckpt_obj_users_inc(ctx, ctx->file, 1); if (ctx->logfile) ckpt_obj_users_inc(ctx, ctx->logfile, 1); /* account for ctx->root_nsproxy (if in the table already) */ ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1); hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) { if (!obj->ops->ref_users) continue; if (obj->ops->obj_type == CKPT_OBJ_SOCK) obj_sock_adjust_users(obj); if (obj->ops->ref_users(obj->ptr) != obj->users) { ckpt_err(ctx, -EBUSY, "%(O)%(P)%(S)Usage leak (%d != %d)\n", obj->objref, obj->ptr, obj->ops->obj_name, obj->ops->ref_users(obj->ptr), obj->users); return 0; } } return 1; } ... So that hopefully addresses your questions regarding the use of the symlinks before the unlink. After the unlink those symlinks are broken since they have "(deleted)" appended. Making sure they are broken after restart is one detail I've thought about. To make it perfect I think we could: 1. Move any existing file at the original symlinked path to a temporary location. 2. Restore the "unlinked" file to that location. (in quotes since it's not unlinked yet) 3. Open the "unlinked" file. 4. Unlink the file again. 5. Move the existing file back from the temporary location. As with relinking, we need a good way to do the "temporary location". That is complicated because we need to choose a location that we have permission to write to, always exists during restart, and is guaranteed not to have files in it. Relinking the file shifts these problems from restart to checkpoint. In case you're bored, before Oren posted these patches I wrote: https://ckpt.wiki.kernel.org/index.php/Checklist/UnlinkedFiles and there's lots of info related to what we do and don't support, many related to files in one way or another, in the table at: https://ckpt.wiki.kernel.org/index.php/Checklist I'll update that page with some of my responses above. Getting your thoughts on my ideas outlined above would be excellent. If you've got some counter proposals I'd be happy to hear them too. I'll add a reference to this thread and an edited collection of my rambling responses to that page if you like. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 14:13 ` Jamie Lokier 0 siblings, 0 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-22 14:13 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Matt Helsley wrote: > > I wonder if d_unlinked() is always true for a file which is opened, > > unlinked or renamed over, but has a hard link to it from elsewhere so > > the on-disk file hasn't gone away. > > Well, if the on-disk file hasn't gone away due to a hardlink then we > won't need to save the file in the checkpoint image -- the filesystem > content backup done during checkpoint should also get the file contents. When that happens, how do you open the correct file on restart? You don't know the other link names unless you scan the entire filesystem. Is that done? > > I guess it probably is. That's kinda neat! I'd hoped there would be a > > good reason for f_dentry eventually ;-) > > > > What about files opened through /proc/self/fd/N before or after the > > original file was unlinked/renamed-over. Where does the dentry point? > > Before the unlink it will result in the same file being opened. If it's > opened by a task being checkpointed then we'll be in the same situation > as the "self" task. If it's opened by a task not being checkpointed then > the "leak detection" code will notice that there's an unaccounted reference > to the file and checkpoint will fail. In a nutshell, is that: If you have a filp (open file pointer (i.e. including seek position)) which is shared between a task which is checkpointed and a task which isn't checkpointed, that is the unaccounted reference and will fail? E.g. as you might get with dup+fork or AF_LOCAL descriptor passing? Assuming yes, that has nothing specific to do with /proc. My question about /proc was just about whether the newly open file shares the dentry or gets a new one, I suppose. Note that... > So that hopefully addresses your questions regarding the use of the symlinks > before the unlink. > > After the unlink those symlinks are broken since they have "(deleted)" > appended. ...the "links" in /proc/N/fd/ are *not* real symlinks, and opening then does not use the text returned by readlink(). The "(deleted)" text doesn't stop them from being opened after they are unliked or renamed over (and it certainly doesn't try to open a file with " (deleted)" in the name :-). > As with relinking, we need a good way to do the "temporary location". > That is complicated because we need to choose a location that we have > permission to write to, always exists during restart, and is guaranteed > not to have files in it. Relinking the file shifts these problems from > restart to checkpoint. It also breaks programs which expect fstat() to always return the same st_ino while a file is open. Even FAT guarantees that, I think :-) Can't win them all :-) -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 3:37 ` Matt Helsley [not found] ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 14:13 ` Jamie Lokier 1 sibling, 0 replies; 88+ messages in thread From: Jamie Lokier @ 2010-03-22 14:13 UTC (permalink / raw) To: Matt Helsley; +Cc: Andreas Dilger, Oren Laadan, linux-fsdevel, containers Matt Helsley wrote: > > I wonder if d_unlinked() is always true for a file which is opened, > > unlinked or renamed over, but has a hard link to it from elsewhere so > > the on-disk file hasn't gone away. > > Well, if the on-disk file hasn't gone away due to a hardlink then we > won't need to save the file in the checkpoint image -- the filesystem > content backup done during checkpoint should also get the file contents. When that happens, how do you open the correct file on restart? You don't know the other link names unless you scan the entire filesystem. Is that done? > > I guess it probably is. That's kinda neat! I'd hoped there would be a > > good reason for f_dentry eventually ;-) > > > > What about files opened through /proc/self/fd/N before or after the > > original file was unlinked/renamed-over. Where does the dentry point? > > Before the unlink it will result in the same file being opened. If it's > opened by a task being checkpointed then we'll be in the same situation > as the "self" task. If it's opened by a task not being checkpointed then > the "leak detection" code will notice that there's an unaccounted reference > to the file and checkpoint will fail. In a nutshell, is that: If you have a filp (open file pointer (i.e. including seek position)) which is shared between a task which is checkpointed and a task which isn't checkpointed, that is the unaccounted reference and will fail? E.g. as you might get with dup+fork or AF_LOCAL descriptor passing? Assuming yes, that has nothing specific to do with /proc. My question about /proc was just about whether the newly open file shares the dentry or gets a new one, I suppose. Note that... > So that hopefully addresses your questions regarding the use of the symlinks > before the unlink. > > After the unlink those symlinks are broken since they have "(deleted)" > appended. ...the "links" in /proc/N/fd/ are *not* real symlinks, and opening then does not use the text returned by readlink(). The "(deleted)" text doesn't stop them from being opened after they are unliked or renamed over (and it certainly doesn't try to open a file with " (deleted)" in the name :-). > As with relinking, we need a good way to do the "temporary location". > That is complicated because we need to choose a location that we have > permission to write to, always exists during restart, and is guaranteed > not to have files in it. Relinking the file shifts these problems from > restart to checkpoint. It also breaks programs which expect fstat() to always return the same st_ino while a file is open. Even FAT guarantees that, I think :-) Can't win them all :-) -- Jamie ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 1:06 ` Matt Helsley [not found] ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-03-22 2:20 ` Jamie Lokier @ 2010-03-22 2:55 ` Serge E. Hallyn 2 siblings, 0 replies; 88+ messages in thread From: Serge E. Hallyn @ 2010-03-22 2:55 UTC (permalink / raw) To: Matt Helsley; +Cc: Jamie Lokier, linux-fsdevel, Andreas Dilger, containers Quoting Matt Helsley (matthltc@us.ibm.com): > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote: > > Matt Helsley wrote: > > > > That said, if the intent is to allow the restore to be done on > > > > another node with a "similar" filesystem (e.g. created by rsync/node > > > > image), instead of having a coherent distributed filesystem on all > > > > of the nodes then the filename makes sense. > > > > > > Yes, this is the intent. > > > > I would worry about programs which are using files which have been > > deleted, renamed, or (very common) renamed-over by another process > > after being opened, as there's a good chance they will successfully > > open the wrong file after c/r, and corrupt state from then on. > > The code in the patches does check for unlinked files and refuses > to checkpoint if an unlinked file is open. Yes, this limits the usefulness Oh, haha - open/mapped unlinked files. Sorry :) -serge ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org> @ 2010-03-20 4:43 ` Matt Helsley 0 siblings, 0 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-20 4:43 UTC (permalink / raw) To: Andreas Dilger Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Fri, Mar 19, 2010 at 05:19:22PM -0600, Andreas Dilger wrote: > On 2010-03-18, at 18:59, Oren Laadan wrote: > >+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, > >struct path *root) > >+{ > >+ fname = ckpt_fill_fname(path, root, buf, &flen); > >+ if (!IS_ERR(fname)) { > >+ ret = ckpt_write_obj_type(ctx, fname, flen, > >+ CKPT_HDR_FILE_NAME); > > What is the intended use case for the checkpoint/restore being > developed here? It seems like a major risk to do the checkpoint Yes, as you anticipated below, we want to be able to migrate the image to a similar node. > using the filename, since this is not guaranteed to stay constant > and the restore may give you a different state than what was running > when the checkpoint was done. Storing a file handle in the We're aware of this. Our assumption is userspace will freeze the filesystem and/or take suitable snapshots (e.g. with btrfs) while the tasks being checkpointed are also frozen. If userspace wants to freeze everything but the task performing the checkpoint then that's fine too. We decided to have userspace checkpoint the filesystem contents because it will likely take an extraordinarily long time. We anticipate that userspace will want to take advantage of many time-saving strategies which would be impossible to anticipate perfectly for our kernel syscall ABI. Even though a wide set of time-saving strategies is available, the goal is to keep the checkpoint image format and content independent of the tools that perform migration. > checkpoint, instead of (or in addition to) the filename would allow > restoring the state correctly. > > Note that you would also need to store some kind of FSID as part of > the file handle, which is a functionality that would be desirable > for Aneesh's recent open_by_handle() patches as well, so getting > this right once would be of use to both projects. I haven't looked at those, sorry. It may be useful but I think there's room for adding that in the future as you hinted above. My guess is, depending on the environment of the restarting machine, an FSID might not even be enough. Again -- I need to find some time to review those patches before I can be sure :). Userspace coordinates the management of the nodes and thus knows best how to map things like major:minor, /dev/foo, and/or uuids to the appropriate "things" when it comes time to restart. The best the kernel can do is provide all of those so that userspace can make the choices it needs to. However, most of that information is already available via /proc in mountinfo or via other userspace tools. So we don't save it in the image nor do we provide new interfaces to get it. > That said, if the intent is to allow the restore to be done on > another node with a "similar" filesystem (e.g. created by rsync/node > image), instead of having a coherent distributed filesystem on all > of the nodes then the filename makes sense. Yes, this is the intent. > I would recommend to store both the file handle+FSID and the > filename, preferring the former for "100% correct" restores on the > same node, and the latter for being able to restore on a similar > node (e.g. system files and such that are expected to be the same on > all nodes, but do not necessarily have the same inode number). This sounds like a good idea for the future. However I do not think inclusion of our patches should be predicated on this since the patches are still useful for local restart (thanks to things like mount namespaces) and migration without file handles. Thanks for having a look at these! Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-19 23:19 ` Andreas Dilger 2010-03-22 10:30 ` Nick Piggin 1 sibling, 0 replies; 88+ messages in thread From: Andreas Dilger @ 2010-03-19 23:19 UTC (permalink / raw) To: Oren Laadan Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On 2010-03-18, at 18:59, Oren Laadan wrote: > +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, > struct path *root) > +{ > + fname = ckpt_fill_fname(path, root, buf, &flen); > + if (!IS_ERR(fname)) { > + ret = ckpt_write_obj_type(ctx, fname, flen, > + CKPT_HDR_FILE_NAME); What is the intended use case for the checkpoint/restore being developed here? It seems like a major risk to do the checkpoint using the filename, since this is not guaranteed to stay constant and the restore may give you a different state than what was running when the checkpoint was done. Storing a file handle in the checkpoint, instead of (or in addition to) the filename would allow restoring the state correctly. Note that you would also need to store some kind of FSID as part of the file handle, which is a functionality that would be desirable for Aneesh's recent open_by_handle() patches as well, so getting this right once would be of use to both projects. That said, if the intent is to allow the restore to be done on another node with a "similar" filesystem (e.g. created by rsync/node image), instead of having a coherent distributed filesystem on all of the nodes then the filename makes sense. I would recommend to store both the file handle+FSID and the filename, preferring the former for "100% correct" restores on the same node, and the latter for being able to restore on a similar node (e.g. system files and such that are expected to be the same on all nodes, but do not necessarily have the same inode number). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-19 23:19 ` Andreas Dilger @ 2010-03-22 10:30 ` Nick Piggin 2010-03-22 13:22 ` Matt Helsley 2010-03-22 13:22 ` Matt Helsley 1 sibling, 2 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 10:30 UTC (permalink / raw) To: Oren Laadan Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote: > @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) > return -EINVAL; /* cleanup by ckpt_ctx_free() */ > } > > + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ > + task_lock(ctx->root_task); > + fs = ctx->root_task->fs; > + read_lock(&fs->lock); > + ctx->root_fs_path = fs->root; > + path_get(&ctx->root_fs_path); > + read_unlock(&fs->lock); > + task_unlock(ctx->root_task); > + > return 0; > } > > diff --git a/checkpoint/files.c b/checkpoint/files.c > new file mode 100644 > index 0000000..7a57b24 > --- /dev/null > +++ b/checkpoint/files.c > +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) > +{ > + struct path tmp = *root; > + char *fname; > + > + BUG_ON(!buf); > + spin_lock(&dcache_lock); > + fname = __d_path(path, &tmp, buf, *len); > + spin_unlock(&dcache_lock); > + if (IS_ERR(fname)) > + return fname; > + *len = (buf + (*len) - fname); > + /* > + * FIX: if __d_path() changed these, it must have stepped out of > + * init's namespace. Since currently we require a unified namespace > + * within the container: simply fail. > + */ > + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) > + fname = ERR_PTR(-EBADF); Maybe something like this is better in fs/? > +static int scan_fds(struct files_struct *files, int **fdtable) > +{ > + struct fdtable *fdt; > + int *fds = NULL; > + int i = 0, n = 0; > + int tot = CKPT_DEFAULT_FDTABLE; > + > + /* > + * We assume that all tasks possibly sharing the file table are > + * frozen (or we are a single process and we checkpoint ourselves). > + * Therefore, we can safely proceed after krealloc() from where we > + * left off. Otherwise the file table may be modified by another > + * task after we scan it. The behavior is this case is undefined, > + * and either checkpoint or restart will likely fail. > + */ > + retry: > + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); > + if (!fds) > + return -ENOMEM; > + > + rcu_read_lock(); > + fdt = files_fdtable(files); > + for (/**/; i < fdt->max_fds; i++) { > + if (!fcheck_files(files, i)) > + continue; > + if (n == tot) { > + rcu_read_unlock(); > + tot *= 2; /* won't overflow: kmalloc will fail */ > + goto retry; > + } > + fds[n++] = i; > + } > + rcu_read_unlock(); ... > +static int checkpoint_file_desc(struct ckpt_ctx *ctx, > + struct files_struct *files, int fd) > +{ > + struct ckpt_hdr_file_desc *h; > + struct file *file = NULL; > + struct fdtable *fdt; > + int objref, ret; > + int coe = 0; /* avoid gcc warning */ > + pid_t pid; > + > + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); > + if (!h) > + return -ENOMEM; > + > + rcu_read_lock(); > + fdt = files_fdtable(files); > + file = fcheck_files(files, fd); > + if (file) { > + coe = FD_ISSET(fd, fdt->close_on_exec); > + get_file(file); > + } > + rcu_read_unlock(); > + > + ret = find_locks_with_owner(file, files); > + /* > + * find_locks_with_owner() returns an error when there > + * are no locks found, so we *want* it to return an error > + * code. Its success means we have to fail the checkpoint. > + */ > + if (!ret) { > + ret = -EBADF; > + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); > + goto out; > + } > + > + /* sanity check (although this shouldn't happen) */ > + ret = -EBADF; > + if (!file) { > + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); > + goto out; > + } > + > + /* > + * TODO: Implement c/r of fowner and f_sigio. Should be > + * trivial, but for now we just refuse its checkpoint > + */ > + pid = f_getown(file); > + if (pid) { > + ret = -EBUSY; > + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); > + goto out; > + } > + > + /* > + * if seen first time, this will add 'file' to the objhash, keep > + * a reference to it, dump its state while at it. > + */ All these kinds of things (including above hunks) IMO are nasty to put outside fs/. It would be nice to see higher level functionality implemented in fs and exported to your checkpoint stuff. Apparently it's hard because checkpointing is so incestuous with everything, but that's why it's important to structure the code well. ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 10:30 ` Nick Piggin @ 2010-03-22 13:22 ` Matt Helsley 2010-03-22 13:22 ` Matt Helsley 1 sibling, 0 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 13:22 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote: > On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote: > > @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) > > return -EINVAL; /* cleanup by ckpt_ctx_free() */ > > } > > > > + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ > > + task_lock(ctx->root_task); > > + fs = ctx->root_task->fs; > > + read_lock(&fs->lock); > > + ctx->root_fs_path = fs->root; > > + path_get(&ctx->root_fs_path); > > + read_unlock(&fs->lock); > > + task_unlock(ctx->root_task); > > + > > return 0; > > } > > > > diff --git a/checkpoint/files.c b/checkpoint/files.c > > new file mode 100644 > > index 0000000..7a57b24 > > --- /dev/null > > +++ b/checkpoint/files.c > > +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) > > +{ > > + struct path tmp = *root; > > + char *fname; > > + > > + BUG_ON(!buf); > > + spin_lock(&dcache_lock); > > + fname = __d_path(path, &tmp, buf, *len); > > + spin_unlock(&dcache_lock); > > + if (IS_ERR(fname)) > > + return fname; > > + *len = (buf + (*len) - fname); > > + /* > > + * FIX: if __d_path() changed these, it must have stepped out of > > + * init's namespace. Since currently we require a unified namespace > > + * within the container: simply fail. > > + */ > > + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) > > + fname = ERR_PTR(-EBADF); > > Maybe something like this is better in fs/? > > > > +static int scan_fds(struct files_struct *files, int **fdtable) > > +{ > > + struct fdtable *fdt; > > + int *fds = NULL; > > + int i = 0, n = 0; > > + int tot = CKPT_DEFAULT_FDTABLE; > > + > > + /* > > + * We assume that all tasks possibly sharing the file table are > > + * frozen (or we are a single process and we checkpoint ourselves). > > + * Therefore, we can safely proceed after krealloc() from where we > > + * left off. Otherwise the file table may be modified by another > > + * task after we scan it. The behavior is this case is undefined, > > + * and either checkpoint or restart will likely fail. > > + */ > > + retry: > > + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); > > + if (!fds) > > + return -ENOMEM; > > + > > + rcu_read_lock(); > > + fdt = files_fdtable(files); > > + for (/**/; i < fdt->max_fds; i++) { > > + if (!fcheck_files(files, i)) > > + continue; > > + if (n == tot) { > > + rcu_read_unlock(); > > + tot *= 2; /* won't overflow: kmalloc will fail */ > > + goto retry; > > + } > > + fds[n++] = i; > > + } > > + rcu_read_unlock(); > > ... > > > +static int checkpoint_file_desc(struct ckpt_ctx *ctx, > > + struct files_struct *files, int fd) > > +{ > > + struct ckpt_hdr_file_desc *h; > > + struct file *file = NULL; > > + struct fdtable *fdt; > > + int objref, ret; > > + int coe = 0; /* avoid gcc warning */ > > + pid_t pid; > > + > > + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); > > + if (!h) > > + return -ENOMEM; > > + > > + rcu_read_lock(); > > + fdt = files_fdtable(files); > > + file = fcheck_files(files, fd); > > + if (file) { > > + coe = FD_ISSET(fd, fdt->close_on_exec); > > + get_file(file); > > + } > > + rcu_read_unlock(); > > + > > + ret = find_locks_with_owner(file, files); > > + /* > > + * find_locks_with_owner() returns an error when there > > + * are no locks found, so we *want* it to return an error > > + * code. Its success means we have to fail the checkpoint. > > + */ > > + if (!ret) { > > + ret = -EBADF; > > + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); > > + goto out; > > + } > > + > > + /* sanity check (although this shouldn't happen) */ > > + ret = -EBADF; > > + if (!file) { > > + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); > > + goto out; > > + } > > + > > + /* > > + * TODO: Implement c/r of fowner and f_sigio. Should be > > + * trivial, but for now we just refuse its checkpoint > > + */ > > + pid = f_getown(file); > > + if (pid) { > > + ret = -EBUSY; > > + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); > > + goto out; > > + } > > + > > + /* > > + * if seen first time, this will add 'file' to the objhash, keep > > + * a reference to it, dump its state while at it. > > + */ > > All these kinds of things (including above hunks) IMO are nasty to put > outside fs/. It would be nice to see higher level functionality > implemented in fs and exported to your checkpoint stuff. Agreed. I posted a series of patches that reorganized the non-filesystem checkpoint/restart code by distributing it to more appropriate places. If you can stomach web interfaces have a look at: http://thread.gmane.org/gmane.linux.kernel.containers/16617 It'll take a some effort to reorganize and retest ckpt-v20 as I did for v19. Then I've got to do the same for the filesystem portions. I think that would complete the reorganization. > Apparently it's hard because checkpointing is so incestuous with > everything, but that's why it's important to structure the code well. You're saying it's difficult to organize because it's got to work with quite a few disparate VFS structures? My impression is the code breaks down pretty well along existing lines (fds, fd table, struct files...). The main problems are resolving the effects of CONFIG_CHECKPOINT=n and header inclusion messes. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 10:30 ` Nick Piggin 2010-03-22 13:22 ` Matt Helsley @ 2010-03-22 13:22 ` Matt Helsley 2010-03-22 13:38 ` Nick Piggin [not found] ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 1 sibling, 2 replies; 88+ messages in thread From: Matt Helsley @ 2010-03-22 13:22 UTC (permalink / raw) To: Nick Piggin Cc: Oren Laadan, linux-fsdevel, containers, Matt Helsley, Andreas Dilger On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote: > On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote: > > @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) > > return -EINVAL; /* cleanup by ckpt_ctx_free() */ > > } > > > > + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ > > + task_lock(ctx->root_task); > > + fs = ctx->root_task->fs; > > + read_lock(&fs->lock); > > + ctx->root_fs_path = fs->root; > > + path_get(&ctx->root_fs_path); > > + read_unlock(&fs->lock); > > + task_unlock(ctx->root_task); > > + > > return 0; > > } > > > > diff --git a/checkpoint/files.c b/checkpoint/files.c > > new file mode 100644 > > index 0000000..7a57b24 > > --- /dev/null > > +++ b/checkpoint/files.c > > +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) > > +{ > > + struct path tmp = *root; > > + char *fname; > > + > > + BUG_ON(!buf); > > + spin_lock(&dcache_lock); > > + fname = __d_path(path, &tmp, buf, *len); > > + spin_unlock(&dcache_lock); > > + if (IS_ERR(fname)) > > + return fname; > > + *len = (buf + (*len) - fname); > > + /* > > + * FIX: if __d_path() changed these, it must have stepped out of > > + * init's namespace. Since currently we require a unified namespace > > + * within the container: simply fail. > > + */ > > + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) > > + fname = ERR_PTR(-EBADF); > > Maybe something like this is better in fs/? > > > > +static int scan_fds(struct files_struct *files, int **fdtable) > > +{ > > + struct fdtable *fdt; > > + int *fds = NULL; > > + int i = 0, n = 0; > > + int tot = CKPT_DEFAULT_FDTABLE; > > + > > + /* > > + * We assume that all tasks possibly sharing the file table are > > + * frozen (or we are a single process and we checkpoint ourselves). > > + * Therefore, we can safely proceed after krealloc() from where we > > + * left off. Otherwise the file table may be modified by another > > + * task after we scan it. The behavior is this case is undefined, > > + * and either checkpoint or restart will likely fail. > > + */ > > + retry: > > + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); > > + if (!fds) > > + return -ENOMEM; > > + > > + rcu_read_lock(); > > + fdt = files_fdtable(files); > > + for (/**/; i < fdt->max_fds; i++) { > > + if (!fcheck_files(files, i)) > > + continue; > > + if (n == tot) { > > + rcu_read_unlock(); > > + tot *= 2; /* won't overflow: kmalloc will fail */ > > + goto retry; > > + } > > + fds[n++] = i; > > + } > > + rcu_read_unlock(); > > ... > > > +static int checkpoint_file_desc(struct ckpt_ctx *ctx, > > + struct files_struct *files, int fd) > > +{ > > + struct ckpt_hdr_file_desc *h; > > + struct file *file = NULL; > > + struct fdtable *fdt; > > + int objref, ret; > > + int coe = 0; /* avoid gcc warning */ > > + pid_t pid; > > + > > + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); > > + if (!h) > > + return -ENOMEM; > > + > > + rcu_read_lock(); > > + fdt = files_fdtable(files); > > + file = fcheck_files(files, fd); > > + if (file) { > > + coe = FD_ISSET(fd, fdt->close_on_exec); > > + get_file(file); > > + } > > + rcu_read_unlock(); > > + > > + ret = find_locks_with_owner(file, files); > > + /* > > + * find_locks_with_owner() returns an error when there > > + * are no locks found, so we *want* it to return an error > > + * code. Its success means we have to fail the checkpoint. > > + */ > > + if (!ret) { > > + ret = -EBADF; > > + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); > > + goto out; > > + } > > + > > + /* sanity check (although this shouldn't happen) */ > > + ret = -EBADF; > > + if (!file) { > > + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); > > + goto out; > > + } > > + > > + /* > > + * TODO: Implement c/r of fowner and f_sigio. Should be > > + * trivial, but for now we just refuse its checkpoint > > + */ > > + pid = f_getown(file); > > + if (pid) { > > + ret = -EBUSY; > > + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); > > + goto out; > > + } > > + > > + /* > > + * if seen first time, this will add 'file' to the objhash, keep > > + * a reference to it, dump its state while at it. > > + */ > > All these kinds of things (including above hunks) IMO are nasty to put > outside fs/. It would be nice to see higher level functionality > implemented in fs and exported to your checkpoint stuff. Agreed. I posted a series of patches that reorganized the non-filesystem checkpoint/restart code by distributing it to more appropriate places. If you can stomach web interfaces have a look at: http://thread.gmane.org/gmane.linux.kernel.containers/16617 It'll take a some effort to reorganize and retest ckpt-v20 as I did for v19. Then I've got to do the same for the filesystem portions. I think that would complete the reorganization. > Apparently it's hard because checkpointing is so incestuous with > everything, but that's why it's important to structure the code well. You're saying it's difficult to organize because it's got to work with quite a few disparate VFS structures? My impression is the code breaks down pretty well along existing lines (fds, fd table, struct files...). The main problems are resolving the effects of CONFIG_CHECKPOINT=n and header inclusion messes. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-22 13:22 ` Matt Helsley @ 2010-03-22 13:38 ` Nick Piggin [not found] ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 1 sibling, 0 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 13:38 UTC (permalink / raw) To: Matt Helsley; +Cc: Oren Laadan, linux-fsdevel, containers, Andreas Dilger On Mon, Mar 22, 2010 at 06:22:32AM -0700, Matt Helsley wrote: > On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote: > > On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote: > > > + /* > > > + * if seen first time, this will add 'file' to the objhash, keep > > > + * a reference to it, dump its state while at it. > > > + */ > > > > All these kinds of things (including above hunks) IMO are nasty to put > > outside fs/. It would be nice to see higher level functionality > > implemented in fs and exported to your checkpoint stuff. > > Agreed. I posted a series of patches that reorganized the non-filesystem > checkpoint/restart code by distributing it to more appropriate places. > If you can stomach web interfaces have a look at: > > http://thread.gmane.org/gmane.linux.kernel.containers/16617 > > It'll take a some effort to reorganize and retest ckpt-v20 as I did for > v19. Then I've got to do the same for the filesystem portions. I think > that would complete the reorganization. It may get easier for fs people to review because they won't have to wade through as much checkpoint code. > > Apparently it's hard because checkpointing is so incestuous with > > everything, but that's why it's important to structure the code well. > > You're saying it's difficult to organize because it's got to work with > quite a few disparate VFS structures? My impression is the code breaks > down pretty well along existing lines (fds, fd table, struct files...). It is that you are poking inside the internals of the vfs from your module. This isn't liked because now any changes to vfs have to be done with an eye to checkpoint/ code. If you can instead implement the required higher level functionality in fs then it is easier to ensure that is correct and that the interfaces are used correctly. > The main problems are resolving the effects of CONFIG_CHECKPOINT=n and > header inclusion messes. That's your main problem? #ifdefs are discouraged but not if it makes the whole structure of the code more convoluted. ifdefs in _ops structure init for example isn't a big problem. ^ permalink raw reply [flat|nested] 88+ messages in thread
[parent not found: <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-03-22 13:38 ` Nick Piggin 0 siblings, 0 replies; 88+ messages in thread From: Nick Piggin @ 2010-03-22 13:38 UTC (permalink / raw) To: Matt Helsley Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger On Mon, Mar 22, 2010 at 06:22:32AM -0700, Matt Helsley wrote: > On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote: > > On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote: > > > + /* > > > + * if seen first time, this will add 'file' to the objhash, keep > > > + * a reference to it, dump its state while at it. > > > + */ > > > > All these kinds of things (including above hunks) IMO are nasty to put > > outside fs/. It would be nice to see higher level functionality > > implemented in fs and exported to your checkpoint stuff. > > Agreed. I posted a series of patches that reorganized the non-filesystem > checkpoint/restart code by distributing it to more appropriate places. > If you can stomach web interfaces have a look at: > > http://thread.gmane.org/gmane.linux.kernel.containers/16617 > > It'll take a some effort to reorganize and retest ckpt-v20 as I did for > v19. Then I've got to do the same for the filesystem portions. I think > that would complete the reorganization. It may get easier for fs people to review because they won't have to wade through as much checkpoint code. > > Apparently it's hard because checkpointing is so incestuous with > > everything, but that's why it's important to structure the code well. > > You're saying it's difficult to organize because it's got to work with > quite a few disparate VFS structures? My impression is the code breaks > down pretty well along existing lines (fds, fd table, struct files...). It is that you are poking inside the internals of the vfs from your module. This isn't liked because now any changes to vfs have to be done with an eye to checkpoint/ code. If you can instead implement the required higher level functionality in fs then it is easier to ensure that is correct and that the interfaces are used correctly. > The main problems are resolving the effects of CONFIG_CHECKPOINT=n and > header inclusion messes. That's your main problem? #ifdefs are discouraged but not if it makes the whole structure of the code more convoluted. ifdefs in _ops structure init for example isn't a big problem. ^ permalink raw reply [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 39/96] c/r: restore open file descriptors 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (2 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan ` (13 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the hash table; If not found in the hash table, (first occurence), read in 'struct ckpt_hdr_file', create a new file and register in the hash. Otherwise attach the file pointer from the hash as an FD. Changelog[v19-rc1]: - Fix lockdep complaint in restore_obj_files() Changelog[v19-rc1]: - Restore thread/cpu state early - Ensure null-termination of file names read from image - Fix compile warning in restore_open_fname() Changelog[v18]: - Invoke set_close_on_exec() unconditionally on restart Changelog[v17]: - Validate f_mode after restore against saved f_mode - Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - Introduce a per file-type restore() callback - Revert change to pr_debug(), back to ckpt_debug() - Rename: restore_files() => restore_fd_table() - Rename: ckpt_read_fd_data() => restore_file() - Check whether calls to ckpt_hbuf_get() fail - Discard field 'hh->parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 318 ++++++++++++++++++++++++++++++++++++++++++++ checkpoint/objhash.c | 2 + checkpoint/process.c | 20 +++ include/linux/checkpoint.h | 7 + 4 files changed, 347 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 7a57b24..b404c8f 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -16,6 +16,8 @@ #include <linux/sched.h> #include <linux/file.h> #include <linux/fdtable.h> +#include <linux/fsnotify.h> +#include <linux/syscalls.h> #include <linux/deferqueue.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -442,3 +444,319 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } + +/************************************************************************** + * Restart + */ + +/** + * restore_open_fname - read a file name and open a file + * @ctx: checkpoint context + * @flags: file flags + */ +struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags) +{ + struct file *file; + char *fname; + int len; + + /* prevent bad input from doing bad things */ + if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC)) + return ERR_PTR(-EINVAL); + + len = ckpt_read_payload(ctx, (void **) &fname, + PATH_MAX, CKPT_HDR_FILE_NAME); + if (len < 0) + return ERR_PTR(len); + fname[len - 1] = '\0'; /* always play if safe */ + ckpt_debug("fname '%s' flags %#x\n", fname, flags); + + file = filp_open(fname, flags, 0); + kfree(fname); + + return file; +} + +static int close_all_fds(struct files_struct *files) +{ + int *fdtable; + int nfds; + + nfds = scan_fds(files, &fdtable); + if (nfds < 0) + return nfds; + while (nfds--) + sys_close(fdtable[nfds]); + kfree(fdtable); + return 0; +} + +/** + * attach_file - attach a lonely file ptr to a file descriptor + * @file: lonely file pointer + */ +static int attach_file(struct file *file) +{ + int fd = get_unused_fd_flags(0); + + if (fd >= 0) { + get_file(file); + fsnotify_open(file->f_path.dentry); + fd_install(fd, file); + } + return fd; +} + +#define CKPT_SETFL_MASK \ + (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME) + +int restore_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + fmode_t new_mode = file->f_mode; + fmode_t saved_mode = (__force fmode_t) h->f_mode; + int ret; + + /* FIX: need to restore uid, gid, owner etc */ + + /* safe to set 1st arg (fd) to 0, as command is F_SETFL */ + ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file); + if (ret < 0) + return ret; + + /* + * Normally f_mode is set by open, and modified only via + * fcntl(), so its value now should match that at checkpoint. + * However, a file may be downgraded from (read-)write to + * read-only, e.g: + * - mark_files_ro() unsets FMODE_WRITE + * - nfs4_file_downgrade() too, and also sert FMODE_READ + * Validate the new f_mode against saved f_mode, allowing: + * - new with FMODE_WRITE, saved without FMODE_WRITE + * - new without FMODE_READ, saved with FMODE_READ + */ + if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) { + new_mode &= ~FMODE_WRITE; + if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ)) + new_mode |= FMODE_READ; + } + /* finally, at this point new mode should match saved mode */ + if (new_mode ^ saved_mode) + return -EINVAL; + + if (file->f_mode & FMODE_LSEEK) + ret = vfs_llseek(file, h->f_pos, SEEK_SET); + + return ret; +} + +static struct file *generic_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr) +{ + struct file *file; + int ret; + + if (ptr->h.type != CKPT_HDR_FILE || + ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC) + return ERR_PTR(-EINVAL); + + file = restore_open_fname(ctx, ptr->f_flags); + if (IS_ERR(file)) + return file; + + ret = restore_file_common(ctx, file, ptr); + if (ret < 0) { + fput(file); + file = ERR_PTR(ret); + } + return file; +} + +struct restore_file_ops { + char *file_name; + enum file_type file_type; + struct file * (*restore) (struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr); +}; + +static struct restore_file_ops restore_file_ops[] = { + /* ignored file */ + { + .file_name = "IGNORE", + .file_type = CKPT_FILE_IGNORE, + .restore = NULL, + }, + /* regular file/directory */ + { + .file_name = "GENERIC", + .file_type = CKPT_FILE_GENERIC, + .restore = generic_file_restore, + }, +}; + +static struct file *do_restore_file(struct ckpt_ctx *ctx) +{ + struct restore_file_ops *ops; + struct ckpt_hdr_file *h; + struct file *file = ERR_PTR(-EINVAL); + + /* + * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file, + * but the actual object depends on the file type. The length + * should never be more than page. + */ + h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE); + if (IS_ERR(h)) + return (struct file *) h; + ckpt_debug("flags %#x mode %#x type %d\n", + h->f_flags, h->f_mode, h->f_type); + + if (h->f_type >= CKPT_FILE_MAX) + goto out; + + ops = &restore_file_ops[h->f_type]; + BUG_ON(ops->file_type != h->f_type); + + if (ops->restore) + file = ops->restore(ctx, h); + out: + ckpt_hdr_put(ctx, h); + return file; +} + +/* restore callback for file pointer */ +void *restore_file(struct ckpt_ctx *ctx) +{ + return (void *) do_restore_file(ctx); +} + +/** + * ckpt_read_file_desc - restore the state of a given file descriptor + * @ctx: checkpoint context + * + * Restores the state of a file descriptor; looks up the objref (in the + * header) in the hash table, and if found picks the matching file and + * use it; otherwise calls restore_file to restore the file too. + */ +static int restore_file_desc(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_file_desc *h; + struct file *file; + int newfd, ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); + if (IS_ERR(h)) + return PTR_ERR(h); + ckpt_debug("ref %d fd %d c.o.e %d\n", + h->fd_objref, h->fd_descriptor, h->fd_close_on_exec); + + ret = -EINVAL; + if (h->fd_objref <= 0 || h->fd_descriptor < 0) + goto out; + + file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE); + if (IS_ERR(file)) { + ret = PTR_ERR(file); + goto out; + } + + newfd = attach_file(file); + if (newfd < 0) { + ret = newfd; + goto out; + } + + ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor); + + /* reposition if newfd isn't desired fd */ + if (newfd != h->fd_descriptor) { + ret = sys_dup2(newfd, h->fd_descriptor); + if (ret < 0) + goto out; + sys_close(newfd); + } + + set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec); + ret = 0; + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +/* restore callback for file table */ +static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_file_table *h; + struct files_struct *files; + int i, ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE); + if (IS_ERR(h)) + return (struct files_struct *) h; + + ckpt_debug("nfds %d\n", h->fdt_nfds); + + ret = -EMFILE; + if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open) + goto out; + + /* + * We assume that restarting tasks, as created in user-space, + * have distinct files_struct objects each. If not, we need to + * call dup_fd() to make sure we don't overwrite an already + * restored one. + */ + + /* point of no return -- close all file descriptors */ + ret = close_all_fds(current->files); + if (ret < 0) + goto out; + + for (i = 0; i < h->fdt_nfds; i++) { + ret = restore_file_desc(ctx); + if (ret < 0) + goto out; + } + + ret = deferqueue_run(ctx->files_deferq); + ckpt_debug("files_deferq ran %d entries\n", ret); + if (ret > 0) + ret = 0; + out: + ckpt_hdr_put(ctx, h); + if (!ret) { + files = current->files; + atomic_inc(&files->count); + } else { + files = ERR_PTR(ret); + } + return files; +} + +void *restore_file_table(struct ckpt_ctx *ctx) +{ + return (void *) do_restore_file_table(ctx); +} + +int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref) +{ + struct files_struct *files; + + files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE); + if (IS_ERR(files)) + return PTR_ERR(files); + + if (files != current->files) { + struct files_struct *prev; + + task_lock(current); + prev = current->files; + current->files = files; + atomic_inc(&files->count); + task_unlock(current); + + put_files_struct(prev); + } + + return 0; +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index f25d130..cacc4c7 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -112,6 +112,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_grab = obj_file_table_grab, .ref_users = obj_file_table_users, .checkpoint = checkpoint_file_table, + .restore = restore_file_table, }, /* file object */ { @@ -121,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_grab = obj_file_grab, .ref_users = obj_file_users, .checkpoint = checkpoint_file, + .restore = restore_file, }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index adc34a2..23e0296 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -348,6 +348,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx) return ret; } +static int restore_task_objs(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_task_objs *h; + int ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS); + if (IS_ERR(h)) + return PTR_ERR(h); + + ret = restore_obj_file_table(ctx, h->files_objref); + ckpt_debug("file_table: ret %d (%p)\n", ret, current->files); + + ckpt_hdr_put(ctx, h); + return ret; +} + int restore_restart_block(struct ckpt_ctx *ctx) { struct ckpt_hdr_restart_block *h; @@ -477,6 +493,10 @@ int restore_task(struct ckpt_ctx *ctx) goto out; ret = restore_cpu(ctx); ckpt_debug("cpu %d\n", ret); + if (ret < 0) + goto out; + ret = restore_task_objs(ctx); + ckpt_debug("objs %d\n", ret); out: return ret; } diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index d74a890..749f30c 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -163,16 +163,23 @@ extern int restore_restart_block(struct ckpt_ctx *ctx); extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t); extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t); +extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref); extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr); +extern void *restore_file_table(struct ckpt_ctx *ctx); /* files */ extern int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root); +extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags); + extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file); extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr); +extern void *restore_file(struct ckpt_ctx *ctx); extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, struct ckpt_hdr_file *h); +extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h); static inline int ckpt_validate_errno(int errno) { -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (3 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan ` (12 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan Changelog[v17] - Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- include/linux/mm.h | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 60c467b..48d67ee 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -19,6 +19,7 @@ struct file_ra_state; struct user_struct; struct writeback_control; struct rlimit; +struct ckpt_ctx; #ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */ extern unsigned long max_mapnr; @@ -220,6 +221,9 @@ struct vm_operations_struct { int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, const nodemask_t *to, unsigned long flags); #endif +#ifdef CONFIG_CHECKPOINT + int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma); +#endif }; struct mmu_gather; -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (4 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan ` (11 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel Cc: containers, Matt Helsley, Andreas Dilger, Dave Hansen, Oren Laadan From: Dave Hansen <dave@linux.vnet.ibm.com> This marks ext[234] as being checkpointable. There will be many more to do this to, but this is a start. Changelog[ckpt-v19-rc3]: - Rebase to kernel 2.6.33 (ext2) Changelog[v1]: - [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- fs/ext2/dir.c | 1 + fs/ext2/file.c | 2 ++ fs/ext3/dir.c | 1 + fs/ext3/file.c | 1 + fs/ext4/dir.c | 1 + fs/ext4/file.c | 4 ++++ 6 files changed, 10 insertions(+), 0 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 7516957..84c17f9 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = { .compat_ioctl = ext2_compat_ioctl, #endif .fsync = ext2_fsync, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 586e358..b38d7b9 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -75,6 +75,7 @@ const struct file_operations ext2_file_operations = { .fsync = ext2_fsync, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; #ifdef CONFIG_EXT2_FS_XIP @@ -90,6 +91,7 @@ const struct file_operations ext2_xip_file_operations = { .open = generic_file_open, .release = ext2_release_file, .fsync = ext2_fsync, + .checkpoint = generic_file_checkpoint, }; #endif diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index 373fa90..65f98af 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = { #endif .fsync = ext3_sync_file, /* BKL held */ .release = ext3_release_dir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext3/file.c b/fs/ext3/file.c index 388bbdf..bcd9b88 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -67,6 +67,7 @@ const struct file_operations ext3_file_operations = { .fsync = ext3_sync_file, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ext3_file_inode_operations = { diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index 9dc9316..f69404c 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = { #endif .fsync = ext4_sync_file, .release = ext4_release_dir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 9630583..93a129b 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -84,6 +84,9 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, static const struct vm_operations_struct ext4_file_vm_ops = { .fault = filemap_fault, .page_mkwrite = ext4_page_mkwrite, +#ifdef CONFIG_CHECKPOINT + .checkpoint = filemap_checkpoint, +#endif }; static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) @@ -146,6 +149,7 @@ const struct file_operations ext4_file_operations = { .fsync = ext4_sync_file, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ext4_file_inode_operations = { -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (5 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan ` (10 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan * /dev/null * /dev/zero * /dev/random * /dev/urandom Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- drivers/char/mem.c | 2 ++ drivers/char/random.c | 2 ++ 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 48788db..57e3443 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -763,6 +763,7 @@ static const struct file_operations null_fops = { .read = read_null, .write = write_null, .splice_write = splice_write_null, + .checkpoint = generic_file_checkpoint, }; #ifdef CONFIG_DEVPORT @@ -779,6 +780,7 @@ static const struct file_operations zero_fops = { .read = read_zero, .write = write_zero, .mmap = mmap_zero, + .checkpoint = generic_file_checkpoint, }; /* diff --git a/drivers/char/random.c b/drivers/char/random.c index 2849713..c082789 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1169,6 +1169,7 @@ const struct file_operations random_fops = { .poll = random_poll, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, + .checkpoint = generic_file_checkpoint, }; const struct file_operations urandom_fops = { @@ -1176,6 +1177,7 @@ const struct file_operations urandom_fops = { .write = random_write, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, + .checkpoint = generic_file_checkpoint, }; /*************************************************************** -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (6 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan ` (9 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger From: Matt Helsley <matthltc@us.ibm.com> These patches extend the use of the generic file checkpoint operation to non-extX filesystems which have lseek operations that ensure we can save and restore the files for later use. Note that this does not include things like FUSE, network filesystems, or pseudo-filesystem kernel interfaces. Only compile and boot tested (on x86-32). [Oren Laadan] Folded patch series into a single patch; original post included 36 separate patches for individual filesystems: [PATCH 01/36] Add the checkpoint operation for affs files and directories. [PATCH 02/36] Add the checkpoint operation for befs directories. [PATCH 03/36] Add the checkpoint operation for bfs files and directories. [PATCH 04/36] Add the checkpoint operation for btrfs files and directories. [PATCH 05/36] Add the checkpoint operation for cramfs directories. [PATCH 06/36] Add the checkpoint operation for ecryptfs files and directories. [PATCH 07/36] Add the checkpoint operation for fat files and directories. [PATCH 08/36] Add the checkpoint operation for freevxfs directories. [PATCH 09/36] Add the checkpoint operation for hfs files and directories. [PATCH 10/36] Add the checkpoint operation for hfsplus files and directories. [PATCH 11/36] Add the checkpoint operation for hpfs files and directories. [PATCH 12/36] Add the checkpoint operation for hppfs files and directories. [PATCH 13/36] Add the checkpoint operation for iso directories. [PATCH 14/36] Add the checkpoint operation for jffs2 files and directories. [PATCH 15/36] Add the checkpoint operation for jfs files and directories. [PATCH 16/36] Add the checkpoint operation for regular nfs files and directories. Skip the various /proc files for now. [PATCH 17/36] Add the checkpoint operation for ntfs directories. [PATCH 18/36] Add the checkpoint operation for openromfs directories. Explicitly skip the properties for now. [PATCH 19/36] Add the checkpoint operation for qnx4 files and directories. [PATCH 20/36] Add the checkpoint operation for reiserfs files and directories. [PATCH 21/36] Add the checkpoint operation for romfs directories. [PATCH 22/36] Add the checkpoint operation for squashfs directories. [PATCH 23/36] Add the checkpoint operation for sysv filesystem files and directories. [PATCH 24/36] Add the checkpoint operation for ubifs files and directories. [PATCH 25/36] Add the checkpoint operation for udf filesystem files and directories. [PATCH 26/36] Add the checkpoint operation for xfs files and directories. [PATCH 27/36] Add checkpoint operation for efs directories. [PATCH 28/36] Add the checkpoint operation for generic, read-only files. At present, some/all files of the following filesystems use this generic definition: [PATCH 29/36] Add checkpoint operation for minix filesystem files and directories. [PATCH 30/36] Add checkpoint operations for omfs files and directories. [PATCH 31/36] Add checkpoint operations for ufs files and directories. [PATCH 32/36] Add checkpoint operations for ramfs files. NOTE: since simple_dir_operations are shared between multiple filesystems including ramfs, it's not currently possible to checkpoint open ramfs directories. [PATCH 33/36] Add the checkpoint operation for adfs files and directories. [PATCH 34/36] Add the checkpoint operation to exofs files and directories. [PATCH 35/36] Add the checkpoint operation to nilfs2 files and directories. [PATCH 36/36] Add checkpoint operations for UML host filesystem files and directories. Changelog[v19-rc3]: - [Suka] Enable C/R while executing over NFS Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Cc: linux-fsdevel@vger.kernel.org --- fs/adfs/dir.c | 1 + fs/adfs/file.c | 1 + fs/affs/dir.c | 1 + fs/affs/file.c | 1 + fs/befs/linuxvfs.c | 1 + fs/bfs/dir.c | 1 + fs/bfs/file.c | 1 + fs/btrfs/file.c | 1 + fs/btrfs/inode.c | 1 + fs/btrfs/super.c | 1 + fs/cramfs/inode.c | 1 + fs/ecryptfs/file.c | 2 ++ fs/ecryptfs/miscdev.c | 1 + fs/efs/dir.c | 1 + fs/exofs/dir.c | 1 + fs/exofs/file.c | 1 + fs/fat/dir.c | 1 + fs/fat/file.c | 1 + fs/freevxfs/vxfs_lookup.c | 1 + fs/hfs/dir.c | 1 + fs/hfs/inode.c | 1 + fs/hfsplus/dir.c | 1 + fs/hfsplus/inode.c | 1 + fs/hostfs/hostfs_kern.c | 2 ++ fs/hpfs/dir.c | 1 + fs/hpfs/file.c | 1 + fs/hppfs/hppfs.c | 2 ++ fs/isofs/dir.c | 1 + fs/jffs2/dir.c | 1 + fs/jffs2/file.c | 1 + fs/jfs/file.c | 1 + fs/jfs/namei.c | 1 + fs/minix/dir.c | 1 + fs/minix/file.c | 1 + fs/nfs/dir.c | 1 + fs/nfs/file.c | 4 ++++ fs/nilfs2/dir.c | 2 +- fs/nilfs2/file.c | 1 + fs/ntfs/dir.c | 1 + fs/ntfs/file.c | 3 ++- fs/omfs/dir.c | 1 + fs/omfs/file.c | 1 + fs/openpromfs/inode.c | 2 ++ fs/qnx4/dir.c | 1 + fs/ramfs/file-mmu.c | 1 + fs/ramfs/file-nommu.c | 1 + fs/read_write.c | 1 + fs/reiserfs/dir.c | 1 + fs/reiserfs/file.c | 1 + fs/romfs/mmap-nommu.c | 1 + fs/romfs/super.c | 1 + fs/squashfs/dir.c | 3 ++- fs/sysv/dir.c | 1 + fs/sysv/file.c | 1 + fs/ubifs/debug.c | 1 + fs/ubifs/dir.c | 1 + fs/ubifs/file.c | 1 + fs/udf/dir.c | 1 + fs/udf/file.c | 1 + fs/ufs/dir.c | 1 + fs/ufs/file.c | 1 + fs/xfs/linux-2.6/xfs_file.c | 2 ++ 62 files changed, 72 insertions(+), 3 deletions(-) diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c index 23aa52f..7106f32 100644 --- a/fs/adfs/dir.c +++ b/fs/adfs/dir.c @@ -198,6 +198,7 @@ const struct file_operations adfs_dir_operations = { .llseek = generic_file_llseek, .readdir = adfs_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; static int diff --git a/fs/adfs/file.c b/fs/adfs/file.c index 005ea34..97bd298 100644 --- a/fs/adfs/file.c +++ b/fs/adfs/file.c @@ -30,6 +30,7 @@ const struct file_operations adfs_file_operations = { .write = do_sync_write, .aio_write = generic_file_aio_write, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations adfs_file_inode_operations = { diff --git a/fs/affs/dir.c b/fs/affs/dir.c index 8ca8f3a..6cc5e43 100644 --- a/fs/affs/dir.c +++ b/fs/affs/dir.c @@ -22,6 +22,7 @@ const struct file_operations affs_dir_operations = { .llseek = generic_file_llseek, .readdir = affs_readdir, .fsync = affs_file_fsync, + .checkpoint = generic_file_checkpoint, }; /* diff --git a/fs/affs/file.c b/fs/affs/file.c index 184e55c..d580a12 100644 --- a/fs/affs/file.c +++ b/fs/affs/file.c @@ -36,6 +36,7 @@ const struct file_operations affs_file_operations = { .release = affs_file_release, .fsync = affs_file_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations affs_file_inode_operations = { diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index 34ddda8..b97f79b 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -67,6 +67,7 @@ static const struct file_operations befs_dir_operations = { .read = generic_read_dir, .readdir = befs_readdir, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations befs_dir_inode_operations = { diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c index 1e41aad..d78015e 100644 --- a/fs/bfs/dir.c +++ b/fs/bfs/dir.c @@ -80,6 +80,7 @@ const struct file_operations bfs_dir_operations = { .readdir = bfs_readdir, .fsync = simple_fsync, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; extern void dump_imap(const char *, struct super_block *); diff --git a/fs/bfs/file.c b/fs/bfs/file.c index 88b9a3f..7f61ed6 100644 --- a/fs/bfs/file.c +++ b/fs/bfs/file.c @@ -29,6 +29,7 @@ const struct file_operations bfs_file_operations = { .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; static int bfs_move_block(unsigned long from, unsigned long to, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 6ed434a..281a2b8 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1164,4 +1164,5 @@ const struct file_operations btrfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = btrfs_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4deb280..606c31d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5971,6 +5971,7 @@ static const struct file_operations btrfs_dir_file_operations = { #endif .release = btrfs_release_file, .fsync = btrfs_sync_file, + .checkpoint = generic_file_checkpoint, }; static struct extent_io_ops btrfs_extent_io_ops = { diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8a1ea6e..7a28ac5 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -718,6 +718,7 @@ static const struct file_operations btrfs_ctl_fops = { .unlocked_ioctl = btrfs_control_ioctl, .compat_ioctl = btrfs_control_ioctl, .owner = THIS_MODULE, + .checkpoint = generic_file_checkpoint, }; static struct miscdevice btrfs_misc = { diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c index dd3634e..0927503 100644 --- a/fs/cramfs/inode.c +++ b/fs/cramfs/inode.c @@ -532,6 +532,7 @@ static const struct file_operations cramfs_directory_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, .readdir = cramfs_readdir, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations cramfs_dir_inode_operations = { diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c index 678172b..a8973ef 100644 --- a/fs/ecryptfs/file.c +++ b/fs/ecryptfs/file.c @@ -305,6 +305,7 @@ const struct file_operations ecryptfs_dir_fops = { .fsync = ecryptfs_fsync, .fasync = ecryptfs_fasync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct file_operations ecryptfs_main_fops = { @@ -322,6 +323,7 @@ const struct file_operations ecryptfs_main_fops = { .fsync = ecryptfs_fsync, .fasync = ecryptfs_fasync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; static int diff --git a/fs/ecryptfs/miscdev.c b/fs/ecryptfs/miscdev.c index 4ec8f61..9fd9b39 100644 --- a/fs/ecryptfs/miscdev.c +++ b/fs/ecryptfs/miscdev.c @@ -481,6 +481,7 @@ static const struct file_operations ecryptfs_miscdev_fops = { .read = ecryptfs_miscdev_read, .write = ecryptfs_miscdev_write, .release = ecryptfs_miscdev_release, + .checkpoint = generic_file_checkpoint, }; static struct miscdevice ecryptfs_miscdev = { diff --git a/fs/efs/dir.c b/fs/efs/dir.c index 7ee6f7e..da344b8 100644 --- a/fs/efs/dir.c +++ b/fs/efs/dir.c @@ -13,6 +13,7 @@ const struct file_operations efs_dir_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, .readdir = efs_readdir, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations efs_dir_inode_operations = { diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c index 4cfab1c..f6693d3 100644 --- a/fs/exofs/dir.c +++ b/fs/exofs/dir.c @@ -667,4 +667,5 @@ const struct file_operations exofs_dir_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, .readdir = exofs_readdir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/exofs/file.c b/fs/exofs/file.c index 839b9dc..257e9da 100644 --- a/fs/exofs/file.c +++ b/fs/exofs/file.c @@ -73,6 +73,7 @@ static int exofs_flush(struct file *file, fl_owner_t id) const struct file_operations exofs_file_operations = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, diff --git a/fs/fat/dir.c b/fs/fat/dir.c index 530b4ca..e3fa353 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -841,6 +841,7 @@ const struct file_operations fat_dir_operations = { .compat_ioctl = fat_compat_dir_ioctl, #endif .fsync = fat_file_fsync, + .checkpoint = generic_file_checkpoint, }; static int fat_get_short_entry(struct inode *dir, loff_t *pos, diff --git a/fs/fat/file.c b/fs/fat/file.c index e8c159d..e5aecc6 100644 --- a/fs/fat/file.c +++ b/fs/fat/file.c @@ -162,6 +162,7 @@ const struct file_operations fat_file_operations = { .ioctl = fat_generic_ioctl, .fsync = fat_file_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; static int fat_cont_expand(struct inode *inode, loff_t size) diff --git a/fs/freevxfs/vxfs_lookup.c b/fs/freevxfs/vxfs_lookup.c index aee049c..3a09132 100644 --- a/fs/freevxfs/vxfs_lookup.c +++ b/fs/freevxfs/vxfs_lookup.c @@ -58,6 +58,7 @@ const struct inode_operations vxfs_dir_inode_ops = { const struct file_operations vxfs_dir_operations = { .readdir = vxfs_readdir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c index 2b3b861..0eef6c2 100644 --- a/fs/hfs/dir.c +++ b/fs/hfs/dir.c @@ -329,6 +329,7 @@ const struct file_operations hfs_dir_operations = { .readdir = hfs_readdir, .llseek = generic_file_llseek, .release = hfs_dir_release, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations hfs_dir_inode_operations = { diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c index a1cbff2..bf8950f 100644 --- a/fs/hfs/inode.c +++ b/fs/hfs/inode.c @@ -607,6 +607,7 @@ static const struct file_operations hfs_file_operations = { .fsync = file_fsync, .open = hfs_file_open, .release = hfs_file_release, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations hfs_file_inode_operations = { diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c index 5f40236..41fbf2d 100644 --- a/fs/hfsplus/dir.c +++ b/fs/hfsplus/dir.c @@ -497,4 +497,5 @@ const struct file_operations hfsplus_dir_operations = { .ioctl = hfsplus_ioctl, .llseek = generic_file_llseek, .release = hfsplus_dir_release, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 1bcf597..19abd7e 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -286,6 +286,7 @@ static const struct file_operations hfsplus_file_operations = { .open = hfsplus_file_open, .release = hfsplus_file_release, .ioctl = hfsplus_ioctl, + .checkpoint = generic_file_checkpoint, }; struct inode *hfsplus_new_inode(struct super_block *sb, int mode) diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c index 032604e..67e2356 100644 --- a/fs/hostfs/hostfs_kern.c +++ b/fs/hostfs/hostfs_kern.c @@ -417,6 +417,7 @@ int hostfs_fsync(struct file *file, struct dentry *dentry, int datasync) static const struct file_operations hostfs_file_fops = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .read = do_sync_read, .splice_read = generic_file_splice_read, .aio_read = generic_file_aio_read, @@ -430,6 +431,7 @@ static const struct file_operations hostfs_file_fops = { static const struct file_operations hostfs_dir_fops = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .readdir = hostfs_readdir, .read = generic_read_dir, }; diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c index 8865c94..dcde10f 100644 --- a/fs/hpfs/dir.c +++ b/fs/hpfs/dir.c @@ -322,4 +322,5 @@ const struct file_operations hpfs_dir_ops = .readdir = hpfs_readdir, .release = hpfs_dir_release, .fsync = hpfs_file_fsync, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c index 3efabff..f1211f0 100644 --- a/fs/hpfs/file.c +++ b/fs/hpfs/file.c @@ -139,6 +139,7 @@ const struct file_operations hpfs_file_ops = .release = hpfs_file_release, .fsync = hpfs_file_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations hpfs_file_iops = diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c index 7239efc..e3c3bd3 100644 --- a/fs/hppfs/hppfs.c +++ b/fs/hppfs/hppfs.c @@ -546,6 +546,7 @@ static const struct file_operations hppfs_file_fops = { .read = hppfs_read, .write = hppfs_write, .open = hppfs_open, + .checkpoint = generic_file_checkpoint, }; struct hppfs_dirent { @@ -597,6 +598,7 @@ static const struct file_operations hppfs_dir_fops = { .readdir = hppfs_readdir, .open = hppfs_dir_open, .fsync = hppfs_fsync, + .checkpoint = generic_file_checkpoint, }; static int hppfs_statfs(struct dentry *dentry, struct kstatfs *sf) diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c index 8ba5441..848059d 100644 --- a/fs/isofs/dir.c +++ b/fs/isofs/dir.c @@ -273,6 +273,7 @@ const struct file_operations isofs_dir_operations = { .read = generic_read_dir, .readdir = isofs_readdir, + .checkpoint = generic_file_checkpoint, }; /* diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index 7aa4417..c7c4dcb 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -41,6 +41,7 @@ const struct file_operations jffs2_dir_operations = .unlocked_ioctl=jffs2_ioctl, .fsync = jffs2_fsync, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c index b7b74e2..f01038d 100644 --- a/fs/jffs2/file.c +++ b/fs/jffs2/file.c @@ -50,6 +50,7 @@ const struct file_operations jffs2_file_operations = .mmap = generic_file_readonly_mmap, .fsync = jffs2_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; /* jffs2_file_inode_operations */ diff --git a/fs/jfs/file.c b/fs/jfs/file.c index 2b70fa7..3bd7114 100644 --- a/fs/jfs/file.c +++ b/fs/jfs/file.c @@ -116,4 +116,5 @@ const struct file_operations jfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = jfs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index c79a427..585a7d2 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -1556,6 +1556,7 @@ const struct file_operations jfs_dir_operations = { .compat_ioctl = jfs_compat_ioctl, #endif .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; static int jfs_ci_hash(struct dentry *dir, struct qstr *this) diff --git a/fs/minix/dir.c b/fs/minix/dir.c index 6198731..74b6fb4 100644 --- a/fs/minix/dir.c +++ b/fs/minix/dir.c @@ -23,6 +23,7 @@ const struct file_operations minix_dir_operations = { .read = generic_read_dir, .readdir = minix_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; static inline void dir_put_page(struct page *page) diff --git a/fs/minix/file.c b/fs/minix/file.c index 3eec3e6..2048d09 100644 --- a/fs/minix/file.c +++ b/fs/minix/file.c @@ -21,6 +21,7 @@ const struct file_operations minix_file_operations = { .mmap = generic_file_mmap, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations minix_file_inode_operations = { diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 3c7f03b..7d9d22a 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -63,6 +63,7 @@ const struct file_operations nfs_dir_operations = { .open = nfs_opendir, .release = nfs_release, .fsync = nfs_fsync_dir, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations nfs_dir_inode_operations = { diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 63f2071..4437ef9 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -78,6 +78,7 @@ const struct file_operations nfs_file_operations = { .splice_write = nfs_file_splice_write, .check_flags = nfs_check_flags, .setlease = nfs_setlease, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations nfs_file_inode_operations = { @@ -577,6 +578,9 @@ out_unlock: static const struct vm_operations_struct nfs_file_vm_ops = { .fault = filemap_fault, .page_mkwrite = nfs_vm_page_mkwrite, +#ifdef CONFIG_CHECKPOINT + .checkpoint = filemap_checkpoint, +#endif }; static int nfs_need_sync_write(struct file *filp, struct inode *inode) diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c index 76d803e..18b2171 100644 --- a/fs/nilfs2/dir.c +++ b/fs/nilfs2/dir.c @@ -702,5 +702,5 @@ const struct file_operations nilfs_dir_operations = { .compat_ioctl = nilfs_ioctl, #endif /* CONFIG_COMPAT */ .fsync = nilfs_sync_file, - + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c index 30292df..4d585b5 100644 --- a/fs/nilfs2/file.c +++ b/fs/nilfs2/file.c @@ -136,6 +136,7 @@ static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma) */ const struct file_operations nilfs_file_operations = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c index 5a9e344..4fe3759 100644 --- a/fs/ntfs/dir.c +++ b/fs/ntfs/dir.c @@ -1572,4 +1572,5 @@ const struct file_operations ntfs_dir_ops = { /*.ioctl = ,*/ /* Perform function on the mounted filesystem. */ .open = ntfs_dir_open, /* Open directory. */ + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c index 43179dd..32a43f5 100644 --- a/fs/ntfs/file.c +++ b/fs/ntfs/file.c @@ -2224,7 +2224,7 @@ const struct file_operations ntfs_file_ops = { mounted filesystem. */ .mmap = generic_file_mmap, /* Mmap file. */ .open = ntfs_file_open, /* Open file. */ - .splice_read = generic_file_splice_read /* Zero-copy data send with + .splice_read = generic_file_splice_read, /* Zero-copy data send with the data source being on the ntfs partition. We do not need to care about the @@ -2234,6 +2234,7 @@ const struct file_operations ntfs_file_ops = { on the ntfs partition. We do not need to care about the data source. */ + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ntfs_file_inode_ops = { diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c index b42d624..e924e33 100644 --- a/fs/omfs/dir.c +++ b/fs/omfs/dir.c @@ -502,4 +502,5 @@ const struct file_operations omfs_dir_operations = { .read = generic_read_dir, .readdir = omfs_readdir, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/omfs/file.c b/fs/omfs/file.c index 399487c..83e63ef 100644 --- a/fs/omfs/file.c +++ b/fs/omfs/file.c @@ -331,6 +331,7 @@ const struct file_operations omfs_file_operations = { .mmap = generic_file_mmap, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations omfs_file_inops = { diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c index ffcd04f..d1f0677 100644 --- a/fs/openpromfs/inode.c +++ b/fs/openpromfs/inode.c @@ -160,6 +160,7 @@ static const struct file_operations openpromfs_prop_ops = { .read = seq_read, .llseek = seq_lseek, .release = seq_release, + .checkpoint = NULL, }; static int openpromfs_readdir(struct file *, void *, filldir_t); @@ -168,6 +169,7 @@ static const struct file_operations openprom_operations = { .read = generic_read_dir, .readdir = openpromfs_readdir, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; static struct dentry *openpromfs_lookup(struct inode *, struct dentry *, struct nameidata *); diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c index 6f30c3d..fa14c55 100644 --- a/fs/qnx4/dir.c +++ b/fs/qnx4/dir.c @@ -80,6 +80,7 @@ const struct file_operations qnx4_dir_operations = .read = generic_read_dir, .readdir = qnx4_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations qnx4_dir_inode_operations = diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c index 78f613c..4430239 100644 --- a/fs/ramfs/file-mmu.c +++ b/fs/ramfs/file-mmu.c @@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = { .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ramfs_file_inode_operations = { diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c index 1739a4a..9cd6208 100644 --- a/fs/ramfs/file-nommu.c +++ b/fs/ramfs/file-nommu.c @@ -45,6 +45,7 @@ const struct file_operations ramfs_file_operations = { .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ramfs_file_inode_operations = { diff --git a/fs/read_write.c b/fs/read_write.c index e258301..65371e1 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -27,6 +27,7 @@ const struct file_operations generic_ro_fops = { .aio_read = generic_file_aio_read, .mmap = generic_file_readonly_mmap, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; EXPORT_SYMBOL(generic_ro_fops); diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c index c094f58..8681419 100644 --- a/fs/reiserfs/dir.c +++ b/fs/reiserfs/dir.c @@ -24,6 +24,7 @@ const struct file_operations reiserfs_dir_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = reiserfs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; static int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry, diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c index da2dba0..b6008f3 100644 --- a/fs/reiserfs/file.c +++ b/fs/reiserfs/file.c @@ -297,6 +297,7 @@ const struct file_operations reiserfs_file_operations = { .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations reiserfs_file_inode_operations = { diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c index f0511e8..03c24d9 100644 --- a/fs/romfs/mmap-nommu.c +++ b/fs/romfs/mmap-nommu.c @@ -72,4 +72,5 @@ const struct file_operations romfs_ro_fops = { .splice_read = generic_file_splice_read, .mmap = romfs_mmap, .get_unmapped_area = romfs_get_unmapped_area, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/romfs/super.c b/fs/romfs/super.c index 42d2135..476ea8e 100644 --- a/fs/romfs/super.c +++ b/fs/romfs/super.c @@ -282,6 +282,7 @@ error: static const struct file_operations romfs_dir_operations = { .read = generic_read_dir, .readdir = romfs_readdir, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations romfs_dir_inode_operations = { diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c index 566b0ea..b0c5336 100644 --- a/fs/squashfs/dir.c +++ b/fs/squashfs/dir.c @@ -231,5 +231,6 @@ failed_read: const struct file_operations squashfs_dir_ops = { .read = generic_read_dir, - .readdir = squashfs_readdir + .readdir = squashfs_readdir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c index 4e50286..53acd29 100644 --- a/fs/sysv/dir.c +++ b/fs/sysv/dir.c @@ -25,6 +25,7 @@ const struct file_operations sysv_dir_operations = { .read = generic_read_dir, .readdir = sysv_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; static inline void dir_put_page(struct page *page) diff --git a/fs/sysv/file.c b/fs/sysv/file.c index 96340c0..aee556d 100644 --- a/fs/sysv/file.c +++ b/fs/sysv/file.c @@ -28,6 +28,7 @@ const struct file_operations sysv_file_operations = { .mmap = generic_file_mmap, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations sysv_file_inode_operations = { diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c index 9049232..e4f23c6 100644 --- a/fs/ubifs/debug.c +++ b/fs/ubifs/debug.c @@ -2623,6 +2623,7 @@ static ssize_t write_debugfs_file(struct file *file, const char __user *buf, static const struct file_operations dfs_fops = { .open = open_debugfs_file, .write = write_debugfs_file, + .checkpoint = generic_file_checkpoint, .owner = THIS_MODULE, }; diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c index 552fb01..89ab2aa 100644 --- a/fs/ubifs/dir.c +++ b/fs/ubifs/dir.c @@ -1228,4 +1228,5 @@ const struct file_operations ubifs_dir_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ubifs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index 16a6444..254a4d9 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1582,4 +1582,5 @@ const struct file_operations ubifs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ubifs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/udf/dir.c b/fs/udf/dir.c index 61d9a76..6586dbe 100644 --- a/fs/udf/dir.c +++ b/fs/udf/dir.c @@ -211,4 +211,5 @@ const struct file_operations udf_dir_operations = { .readdir = udf_readdir, .ioctl = udf_ioctl, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/udf/file.c b/fs/udf/file.c index f311d50..e671552 100644 --- a/fs/udf/file.c +++ b/fs/udf/file.c @@ -215,6 +215,7 @@ const struct file_operations udf_file_operations = { .fsync = simple_fsync, .splice_read = generic_file_splice_read, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations udf_file_inode_operations = { diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c index 22af68f..29c9396 100644 --- a/fs/ufs/dir.c +++ b/fs/ufs/dir.c @@ -668,4 +668,5 @@ const struct file_operations ufs_dir_operations = { .readdir = ufs_readdir, .fsync = simple_fsync, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ufs/file.c b/fs/ufs/file.c index 73655c6..15c8616 100644 --- a/fs/ufs/file.c +++ b/fs/ufs/file.c @@ -43,4 +43,5 @@ const struct file_operations ufs_file_operations = { .open = generic_file_open, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index e4caeb2..926f377 100644 --- a/fs/xfs/linux-2.6/xfs_file.c +++ b/fs/xfs/linux-2.6/xfs_file.c @@ -259,6 +259,7 @@ const struct file_operations xfs_file_operations = { #ifdef HAVE_FOP_OPEN_EXEC .open_exec = xfs_file_open_exec, #endif + .checkpoint = generic_file_checkpoint, }; const struct file_operations xfs_dir_file_operations = { @@ -271,6 +272,7 @@ const struct file_operations xfs_dir_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .fsync = xfs_file_fsync, + .checkpoint = generic_file_checkpoint, }; static const struct vm_operations_struct xfs_file_vm_ops = { -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (7 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (8 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan During pipes c/r pipes we need to save and restore pipe buffers. But do_splice() requires two file descriptors, therefore we can't use it, as we always have one file descriptor (checkpoint image) and one pipe_inode_info. This patch exports interfaces that work at the pipe_inode_info level, namely link_pipe(), do_splice_to() and do_splice_from(). They are used in the following patch to to save and restore pipe buffers without unnecessary data copy. It slightly modifies both do_splice_to() and do_splice_from() to detect the case of pipe-to-pipe transfer, in which case they invoke splice_pipe_to_pipe() directly. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- fs/splice.c | 61 ++++++++++++++++++++++++++++++++--------------- include/linux/splice.h | 9 +++++++ 2 files changed, 50 insertions(+), 20 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 3920866..76acb55 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1051,18 +1051,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, EXPORT_SYMBOL(generic_splice_sendpage); /* + * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same + * location, so checking ->i_pipe is not enough to verify that this is a + * pipe. + */ +static inline struct pipe_inode_info *pipe_info(struct inode *inode) +{ + if (S_ISFIFO(inode->i_mode)) + return inode->i_pipe; + + return NULL; +} + +static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + size_t len, unsigned int flags); + +/* * Attempt to initiate a splice from pipe to file. */ -static long do_splice_from(struct pipe_inode_info *pipe, struct file *out, - loff_t *ppos, size_t len, unsigned int flags) +long do_splice_from(struct pipe_inode_info *pipe, struct file *out, + loff_t *ppos, size_t len, unsigned int flags) { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); + struct pipe_inode_info *opipe; int ret; if (unlikely(!(out->f_mode & FMODE_WRITE))) return -EBADF; + /* When called directly (e.g. from c/r) output may be a pipe */ + opipe = pipe_info(out->f_path.dentry->d_inode); + if (opipe) { + BUG_ON(opipe == pipe); + return splice_pipe_to_pipe(pipe, opipe, len, flags); + } + if (unlikely(out->f_flags & O_APPEND)) return -EINVAL; @@ -1081,17 +1106,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out, /* * Attempt to initiate a splice from a file to a pipe. */ -static long do_splice_to(struct file *in, loff_t *ppos, - struct pipe_inode_info *pipe, size_t len, - unsigned int flags) +long do_splice_to(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) { ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); + struct pipe_inode_info *ipipe; int ret; if (unlikely(!(in->f_mode & FMODE_READ))) return -EBADF; + /* When called firectly (e.g. from c/r) input may be a pipe */ + ipipe = pipe_info(in->f_path.dentry->d_inode); + if (ipipe) { + BUG_ON(ipipe == pipe); + return splice_pipe_to_pipe(ipipe, pipe, len, flags); + } + ret = rw_verify_area(READ, in, ppos, len); if (unlikely(ret < 0)) return ret; @@ -1271,18 +1304,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, struct pipe_inode_info *opipe, size_t len, unsigned int flags); -/* - * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same - * location, so checking ->i_pipe is not enough to verify that this is a - * pipe. - */ -static inline struct pipe_inode_info *pipe_info(struct inode *inode) -{ - if (S_ISFIFO(inode->i_mode)) - return inode->i_pipe; - - return NULL; -} /* * Determine where to splice to/from. @@ -1887,9 +1908,9 @@ retry: /* * Link contents of ipipe to opipe. */ -static int link_pipe(struct pipe_inode_info *ipipe, - struct pipe_inode_info *opipe, - size_t len, unsigned int flags) +int link_pipe(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + size_t len, unsigned int flags) { struct pipe_buffer *ibuf, *obuf; int ret = 0, i = 0, nbuf; diff --git a/include/linux/splice.h b/include/linux/splice.h index 18e7c7c..431662c 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *, extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *, splice_direct_actor *); +extern int link_pipe(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + size_t len, unsigned int flags); +extern long do_splice_to(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags); +extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out, + loff_t *ppos, size_t len, unsigned int flags); + #endif -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
[parent not found: <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan ` (15 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger These two are used in the next patch when calling vfs_read/write() Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- fs/read_write.c | 10 ---------- include/linux/fs.h | 10 ++++++++++ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index b7f4a1f..e258301 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ EXPORT_SYMBOL(vfs_write); -static inline loff_t file_pos_read(struct file *file) -{ - return file->f_pos; -} - -static inline void file_pos_write(struct file *file, loff_t pos) -{ - file->f_pos = pos; -} - SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct file *file; diff --git a/include/linux/fs.h b/include/linux/fs.h index ebb1cd5..6c08df2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, struct iovec *fast_pointer, struct iovec **ret_pointer); +static inline loff_t file_pos_read(struct file *file) +{ + return file->f_pos; +} + +static inline void file_pos_write(struct file *file, loff_t pos) +{ + file->f_pos = pos; +} + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_readv(struct file *, const struct iovec __user *, -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan ` (14 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger While we assume all normal files and directories can be checkpointed, there are, as usual in the VFS, specialized places that will always need an ability to override these defaults. Although we could do this completely in the checkpoint code, that would bitrot quickly. This adds a new 'file_operations' function for checkpointing a file. It is assumed that there should be a dirt-simple way to make something (un)checkpointable that fits in with current code. As you can see in the ext[234] patches down the road, all that we have to do to make something simple be supported is add a single "generic" f_op entry. Also adds a new 'file_operations' function for 'collecting' a file for leak-detection during full-container checkpoint. This is useful for those files that hold references to other "collectable" objects. Two examples are pty files that point to corresponding tty objects, and eventpoll files that refer to the files they are monitoring. Finally, this patch introduces vfs_fcntl() so that it can be called from restart (see patch adding restart of files). Changelog[v17] - Introduce 'collect' method Changelog[v17] - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- fs/fcntl.c | 21 +++++++++++++-------- include/linux/fs.h | 7 +++++++ 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 97e01dc..e1f02ca 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, return err; } +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) +{ + int err; + + err = security_file_fcntl(filp, cmd, arg); + if (err) + goto out; + err = do_fcntl(fd, cmd, arg, filp); + out: + return err; +} + SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) { struct file *filp; @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) if (!filp) goto out; - err = security_file_fcntl(filp, cmd, arg); - if (err) { - fput(filp); - return err; - } - - err = do_fcntl(fd, cmd, arg, filp); - + err = vfs_fcntl(fd, cmd, arg, filp); fput(filp); out: return err; diff --git a/include/linux/fs.h b/include/linux/fs.h index 6c08df2..65ebec5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -394,6 +394,7 @@ struct kstatfs; struct vm_area_struct; struct vfsmount; struct cred; +struct ckpt_ctx; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -1093,6 +1094,8 @@ struct file_lock { #include <linux/fcntl.h> +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); + extern void send_sigio(struct fown_struct *fown, int fd, int band); #ifdef CONFIG_FILE_LOCKING @@ -1504,6 +1507,8 @@ struct file_operations { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); + int (*checkpoint)(struct ckpt_ctx *, struct file *); + int (*collect)(struct ckpt_ctx *, struct file *); }; struct inode_operations { @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#define generic_file_checkpoint NULL + extern int vfs_readdir(struct file *, filldir_t, void *); extern int vfs_stat(char __user *, struct kstat *); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan ` (13 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger Dump the file table with 'struct ckpt_hdr_file_table, followed by all open file descriptors. Because the 'struct file' corresponding to an fd can be shared, they are assigned an objref and registered in the object hash. A reference to the 'file *' is kept for as long as it lives in the hash (the hash is only cleaned up at the end of the checkpoint). Also provide generic_checkpoint_file() and generic_restore_file() which is good for normal files and directories. It does not support yet unlinked files or directories. Changelog[v19]: - Fix false negative of test for unlinked files at checkpoint Changelog[v19-rc3]: - [Serge Hallyn] Rename fs_mnt to root_fs_path - [Dave Hansen] Error out on file locks and leases - [Serge Hallyn] Refuse checkpoint of file with f_owner Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Add a few more ckpt_write_err()s - [Dan Smith] Export fill_fname() as ckpt_fill_fname() - Introduce ckpt_collect_file() that also uses file->collect method - In collect_file_stabl() use retval from ckpt_obj_collect() to test for first-time-object Changelog[v17]: - Only collect sub-objects of files_struct once - Better file error debugging - Use (new) d_unlinked() Changelog[v16]: - Fix compile warning in checkpoint_bad() Changelog[v16]: - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - File objects are dumped/restored prior to the first reference - Introduce a per file-type restore() callback - Use struct file_operations->checkpoint() - Put code for generic file descriptors in a separate function - Use one CKPT_FILE_GENERIC for both regular files and dirs - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Rename: ckpt_write_files() => checkpoint_fd_table() - Rename: ckpt_write_fd_data() => checkpoint_file() - Discard field 'h->parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Discard handling of opened symlinks (there is no such thing) - ckpt_scan_fds() retries from scratch if hits size limits Changelog[v9]: - Fix a couple of leaks in ckpt_write_files() - Drop useless kfree from ckpt_scan_fds() Changelog[v8]: - initialize 'coe' to workaround gcc false warning Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/Makefile | 3 +- checkpoint/checkpoint.c | 11 + checkpoint/files.c | 444 ++++++++++++++++++++++++++++++++++++++ checkpoint/objhash.c | 52 +++++ checkpoint/process.c | 33 +++- checkpoint/sys.c | 8 + fs/locks.c | 35 +++ include/linux/checkpoint.h | 19 ++ include/linux/checkpoint_hdr.h | 59 +++++ include/linux/checkpoint_types.h | 5 + include/linux/fs.h | 10 + 11 files changed, 677 insertions(+), 2 deletions(-) create mode 100644 checkpoint/files.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 5aa6a75..1d0c058 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \ objhash.o \ checkpoint.o \ restart.o \ - process.o + process.o \ + files.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index c016a2d..2bc2495 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -18,6 +18,7 @@ #include <linux/time.h> #include <linux/fs.h> #include <linux/file.h> +#include <linux/fs_struct.h> #include <linux/dcache.h> #include <linux/mount.h> #include <linux/utsname.h> @@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) { struct task_struct *task; struct nsproxy *nsproxy; + struct fs_struct *fs; /* * No need for explicit cleanup here, because if an error @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) return -EINVAL; /* cleanup by ckpt_ctx_free() */ } + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ + task_lock(ctx->root_task); + fs = ctx->root_task->fs; + read_lock(&fs->lock); + ctx->root_fs_path = fs->root; + path_get(&ctx->root_fs_path); + read_unlock(&fs->lock); + task_unlock(ctx->root_task); + return 0; } diff --git a/checkpoint/files.c b/checkpoint/files.c new file mode 100644 index 0000000..7a57b24 --- /dev/null +++ b/checkpoint/files.c @@ -0,0 +1,444 @@ +/* + * Checkpoint file descriptors + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DFILE + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/deferqueue.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + + +/************************************************************************** + * Checkpoint + */ + +/** + * ckpt_fill_fname - return pathname of a given file + * @path: path name + * @root: relative root + * @buf: buffer for pathname + * @len: buffer length (in) and pathname length (out) + */ +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) +{ + struct path tmp = *root; + char *fname; + + BUG_ON(!buf); + spin_lock(&dcache_lock); + fname = __d_path(path, &tmp, buf, *len); + spin_unlock(&dcache_lock); + if (IS_ERR(fname)) + return fname; + *len = (buf + (*len) - fname); + /* + * FIX: if __d_path() changed these, it must have stepped out of + * init's namespace. Since currently we require a unified namespace + * within the container: simply fail. + */ + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) + fname = ERR_PTR(-EBADF); + + return fname; +} + +/** + * checkpoint_fname - write a file name + * @ctx: checkpoint context + * @path: path name + * @root: relative root + */ +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root) +{ + char *buf, *fname; + int ret, flen; + + /* + * FIXME: we can optimize and save memory (and storage) if we + * share strings (through objhash) and reference them instead + */ + + flen = PATH_MAX; + buf = kmalloc(flen, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + fname = ckpt_fill_fname(path, root, buf, &flen); + if (!IS_ERR(fname)) { + ret = ckpt_write_obj_type(ctx, fname, flen, + CKPT_HDR_FILE_NAME); + } else { + ret = PTR_ERR(fname); + ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n", + path->dentry->d_name.name); + } + + kfree(buf); + return ret; +} + +#define CKPT_DEFAULT_FDTABLE 256 /* an initial guess */ + +/** + * scan_fds - scan file table and construct array of open fds + * @files: files_struct pointer + * @fdtable: (output) array of open fds + * + * Returns the number of open fds found, and also the file table + * array via *fdtable. The caller should free the array. + * + * The caller must validate the file descriptors collected in the + * array before using them, e.g. by using fcheck_files(), in case + * the task's fdtable changes in the meantime. + */ +static int scan_fds(struct files_struct *files, int **fdtable) +{ + struct fdtable *fdt; + int *fds = NULL; + int i = 0, n = 0; + int tot = CKPT_DEFAULT_FDTABLE; + + /* + * We assume that all tasks possibly sharing the file table are + * frozen (or we are a single process and we checkpoint ourselves). + * Therefore, we can safely proceed after krealloc() from where we + * left off. Otherwise the file table may be modified by another + * task after we scan it. The behavior is this case is undefined, + * and either checkpoint or restart will likely fail. + */ + retry: + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); + if (!fds) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + for (/**/; i < fdt->max_fds; i++) { + if (!fcheck_files(files, i)) + continue; + if (n == tot) { + rcu_read_unlock(); + tot *= 2; /* won't overflow: kmalloc will fail */ + goto retry; + } + fds[n++] = i; + } + rcu_read_unlock(); + + *fdtable = fds; + return n; +} + +int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + h->f_flags = file->f_flags; + h->f_mode = file->f_mode; + h->f_pos = file->f_pos; + h->f_version = file->f_version; + + ckpt_debug("file %s credref %d", file->f_dentry->d_name.name, + h->f_credref); + + /* FIX: need also file->uid, file->gid, file->f_owner, etc */ + + return 0; +} + +int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_generic *h; + int ret; + + /* + * FIXME: when we'll add support for unlinked files/dirs, we'll + * need to distinguish between unlinked filed and unlinked dirs. + */ + if (d_unlinked(file->f_dentry)) { + ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", + file); + return -EBADF; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_GENERIC; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->common.h); + if (ret < 0) + goto out; + ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); + out: + ckpt_hdr_put(ctx, h); + return ret; +} +EXPORT_SYMBOL(generic_file_checkpoint); + +/* checkpoint callback for file pointer */ +int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) +{ + struct file *file = (struct file *) ptr; + int ret; + + if (!file->f_op || !file->f_op->checkpoint) { + ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", + file, file->f_op); + return -EBADF; + } + + ret = file->f_op->checkpoint(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); + return ret; +} + +/** + * ckpt_write_file_desc - dump the state of a given file descriptor + * @ctx: checkpoint context + * @files: files_struct pointer + * @fd: file descriptor + * + * Saves the state of the file descriptor; looks up the actual file + * pointer in the hash table, and if found saves the matching objref, + * otherwise calls ckpt_write_file to dump the file pointer too. + */ +static int checkpoint_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct ckpt_hdr_file_desc *h; + struct file *file = NULL; + struct fdtable *fdt; + int objref, ret; + int coe = 0; /* avoid gcc warning */ + pid_t pid; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); + if (!h) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) { + coe = FD_ISSET(fd, fdt->close_on_exec); + get_file(file); + } + rcu_read_unlock(); + + ret = find_locks_with_owner(file, files); + /* + * find_locks_with_owner() returns an error when there + * are no locks found, so we *want* it to return an error + * code. Its success means we have to fail the checkpoint. + */ + if (!ret) { + ret = -EBADF; + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); + goto out; + } + + /* sanity check (although this shouldn't happen) */ + ret = -EBADF; + if (!file) { + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); + goto out; + } + + /* + * TODO: Implement c/r of fowner and f_sigio. Should be + * trivial, but for now we just refuse its checkpoint + */ + pid = f_getown(file); + if (pid) { + ret = -EBUSY; + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); + goto out; + } + + /* + * if seen first time, this will add 'file' to the objhash, keep + * a reference to it, dump its state while at it. + */ + objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE); + ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe); + if (objref < 0) { + ret = objref; + goto out; + } + + h->fd_objref = objref; + h->fd_descriptor = fd; + h->fd_close_on_exec = coe; + + ret = ckpt_write_obj(ctx, &h->h); +out: + ckpt_hdr_put(ctx, h); + if (file) + fput(file); + return ret; +} + +static int do_checkpoint_file_table(struct ckpt_ctx *ctx, + struct files_struct *files) +{ + struct ckpt_hdr_file_table *h; + int *fdtable = NULL; + int nfds, n, ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE); + if (!h) + return -ENOMEM; + + nfds = scan_fds(files, &fdtable); + if (nfds < 0) { + ret = nfds; + goto out; + } + + h->fdt_nfds = nfds; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + goto out; + + ckpt_debug("nfds %d\n", nfds); + for (n = 0; n < nfds; n++) { + ret = checkpoint_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + goto out; + } + + ret = deferqueue_run(ctx->files_deferq); + ckpt_debug("files_deferq ran %d entries\n", ret); + if (ret > 0) + ret = 0; + out: + kfree(fdtable); + return ret; +} + +/* checkpoint callback for file table */ +int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_file_table(ctx, (struct files_struct *) ptr); +} + +/* checkpoint wrapper for file table */ +int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int objref; + + files = get_files_struct(t); + if (!files) + return -EBUSY; + objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE); + put_files_struct(files); + + return objref; +} + +/*********************************************************************** + * Collect + */ + +int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file) +{ + int ret; + + ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE); + if (ret <= 0) + return ret; + /* if first time for this file (ret > 0), invoke ->collect() */ + if (file->f_op->collect) + ret = file->f_op->collect(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file); + return ret; +} + +static int collect_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct fdtable *fdt; + struct file *file; + int ret; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) + get_file(file); + rcu_read_unlock(); + + if (!file) { + ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file); + return -EBUSY; + } + + ret = ckpt_collect_file(ctx, file); + fput(file); + + return ret; +} + +static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files) +{ + int *fdtable; + int nfds, n; + int ret; + + /* if already exists (ret == 0), nothing to do */ + ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE); + if (ret <= 0) + return ret; + + /* if first time for this file table (ret > 0), proceed inside */ + nfds = scan_fds(files, &fdtable); + if (nfds < 0) + return nfds; + + for (n = 0; n < nfds; n++) { + ret = collect_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + break; + } + + kfree(fdtable); + return ret; +} + +int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int ret; + + files = get_files_struct(t); + if (!files) { + ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n"); + return -EBUSY; + } + ret = collect_file_table(ctx, files); + put_files_struct(files); + + return ret; +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 22b1601..f25d130 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -13,6 +13,8 @@ #include <linux/kernel.h> #include <linux/hash.h> +#include <linux/file.h> +#include <linux/fdtable.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr) return 0; } +static int obj_file_table_grab(void *ptr) +{ + atomic_inc(&((struct files_struct *) ptr)->count); + return 0; +} + +static void obj_file_table_drop(void *ptr, int lastref) +{ + put_files_struct((struct files_struct *) ptr); +} + +static int obj_file_table_users(void *ptr) +{ + return atomic_read(&((struct files_struct *) ptr)->count); +} + +static int obj_file_grab(void *ptr) +{ + get_file((struct file *) ptr); + return 0; +} + +static void obj_file_drop(void *ptr, int lastref) +{ + fput((struct file *) ptr); +} + +static int obj_file_users(void *ptr) +{ + return atomic_long_read(&((struct file *) ptr)->f_count); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_drop = obj_no_drop, .ref_grab = obj_no_grab, }, + /* files_struct object */ + { + .obj_name = "FILE_TABLE", + .obj_type = CKPT_OBJ_FILE_TABLE, + .ref_drop = obj_file_table_drop, + .ref_grab = obj_file_table_grab, + .ref_users = obj_file_table_users, + .checkpoint = checkpoint_file_table, + }, + /* file object */ + { + .obj_name = "FILE", + .obj_type = CKPT_OBJ_FILE, + .ref_drop = obj_file_drop, + .ref_grab = obj_file_grab, + .ref_users = obj_file_users, + .checkpoint = checkpoint_file, + }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index ef394a5..adc34a2 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN); } +static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_task_objs *h; + int files_objref; + int ret; + + files_objref = checkpoint_obj_file_table(ctx, t); + ckpt_debug("files: objref %d\n", files_objref); + if (files_objref < 0) { + ckpt_err(ctx, files_objref, "%(T)files_struct\n"); + return files_objref; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS); + if (!h) + return -ENOMEM; + h->files_objref = files_objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + return ret; +} + /* dump the task_struct of a given task */ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t) { @@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) goto out; ret = checkpoint_cpu(ctx, t); ckpt_debug("cpu %d\n", ret); + if (ret < 0) + goto out; + ret = checkpoint_task_objs(ctx, t); + ckpt_debug("objs %d\n", ret); out: ctx->tsk = NULL; return ret; @@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) { - return 0; + int ret; + + ret = ckpt_collect_file_table(ctx, t); + + return ret; } /*********************************************************************** diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 926c937..30b8004 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->kflags & CKPT_CTX_RESTART) restore_debug_free(ctx); + if (ctx->files_deferq) + deferqueue_destroy(ctx->files_deferq); + if (ctx->file) fput(ctx->file); if (ctx->logfile) fput(ctx->logfile); ckpt_obj_hash_free(ctx); + path_put(&ctx->root_fs_path); if (ctx->tasks_arr) task_arr_free(ctx); @@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, if (ckpt_obj_hash_alloc(ctx) < 0) goto err; + ctx->files_deferq = deferqueue_create(); + if (!ctx->files_deferq) + goto err; + atomic_inc(&ctx->refcount); return ctx; err: diff --git a/fs/locks.c b/fs/locks.c index a8794f2..721481a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner) EXPORT_SYMBOL(locks_remove_posix); +int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + struct inode *inode = filp->f_path.dentry->d_inode; + struct file_lock **inode_fl; + int ret = -EEXIST; + + lock_kernel(); + for_each_lock(inode, inode_fl) { + struct file_lock *fl = *inode_fl; + /* + * We could use posix_same_owner() along with a 'fake' + * file_lock. But, the fake file will never have the + * same fl_lmops as the fl that we are looking for and + * posix_same_owner() would just fall back to this + * check anyway. + */ + if (IS_POSIX(fl)) { + if (fl->fl_owner == owner) { + ret = 0; + break; + } + } else if (IS_FLOCK(fl) || IS_LEASE(fl)) { + if (fl->fl_file == filp) { + ret = 0; + break; + } + } else { + WARN(1, "unknown file lock type, fl_flags: %x", + fl->fl_flags); + } + } + unlock_kernel(); + return ret; +} + /* * This function is called on the last close of an open file. */ diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 50ce8f9..d74a890 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx, extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max); extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type); +extern char *ckpt_fill_fname(struct path *path, struct path *root, + char *buf, int *len); + /* ckpt kflags */ #define ckpt_set_ctx_kflag(__ctx, __kflag) \ set_bit(__kflag##_BIT, &(__ctx)->kflags) @@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_restart_block(struct ckpt_ctx *ctx); +/* file table */ +extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, + struct task_struct *t); +extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr); + +/* files */ +extern int checkpoint_fname(struct ckpt_ctx *ctx, + struct path *path, struct path *root); +extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file); +extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr); + +extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h); + static inline int ckpt_validate_errno(int errno) { return (errno >= 0) && (errno < MAX_ERRNO); @@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno) #define CKPT_DSYS 0x2 /* generic (system) */ #define CKPT_DRW 0x4 /* image read/write */ #define CKPT_DOBJ 0x8 /* shared objects */ +#define CKPT_DFILE 0x10 /* files and filesystem */ #define CKPT_DDEFAULT 0xffff /* default debug level */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index cdca9e4..3222545 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -71,6 +71,8 @@ enum { #define CKPT_HDR_TREE CKPT_HDR_TREE CKPT_HDR_TASK, #define CKPT_HDR_TASK CKPT_HDR_TASK + CKPT_HDR_TASK_OBJS, +#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS CKPT_HDR_RESTART_BLOCK, #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK CKPT_HDR_THREAD, @@ -80,6 +82,15 @@ enum { /* 201-299: reserved for arch-dependent */ + CKPT_HDR_FILE_TABLE = 301, +#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE + CKPT_HDR_FILE_DESC, +#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC + CKPT_HDR_FILE_NAME, +#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME + CKPT_HDR_FILE, +#define CKPT_HDR_FILE CKPT_HDR_FILE + CKPT_HDR_TAIL = 9001, #define CKPT_HDR_TAIL CKPT_HDR_TAIL @@ -106,6 +117,10 @@ struct ckpt_hdr_objref { enum obj_type { CKPT_OBJ_IGNORE = 0, #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE + CKPT_OBJ_FILE_TABLE, +#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE + CKPT_OBJ_FILE, +#define CKPT_OBJ_FILE CKPT_OBJ_FILE CKPT_OBJ_MAX #define CKPT_OBJ_MAX CKPT_OBJ_MAX }; @@ -188,6 +203,12 @@ struct ckpt_hdr_task { __u64 clear_child_tid; } __attribute__((aligned(8))); +/* task's shared resources */ +struct ckpt_hdr_task_objs { + struct ckpt_hdr h; + __s32 files_objref; +} __attribute__((aligned(8))); + /* restart blocks */ struct ckpt_hdr_restart_block { struct ckpt_hdr h; @@ -220,4 +241,42 @@ enum restart_block_type { #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX }; +/* file system */ +struct ckpt_hdr_file_table { + struct ckpt_hdr h; + __s32 fdt_nfds; +} __attribute__((aligned(8))); + +/* file descriptors */ +struct ckpt_hdr_file_desc { + struct ckpt_hdr h; + __s32 fd_objref; + __s32 fd_descriptor; + __u32 fd_close_on_exec; +} __attribute__((aligned(8))); + +enum file_type { + CKPT_FILE_IGNORE = 0, +#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE + CKPT_FILE_GENERIC, +#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC + CKPT_FILE_MAX +#define CKPT_FILE_MAX CKPT_FILE_MAX +}; + +/* file objects */ +struct ckpt_hdr_file { + struct ckpt_hdr h; + __u32 f_type; + __u32 f_mode; + __u32 f_flags; + __u32 _padding; + __u64 f_pos; + __u64 f_version; +} __attribute__((aligned(8))); + +struct ckpt_hdr_file_generic { + struct ckpt_hdr_file common; +} __attribute__((aligned(8))); + #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 90bbb16..aae6755 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -14,6 +14,8 @@ #include <linux/sched.h> #include <linux/nsproxy.h> +#include <linux/list.h> +#include <linux/path.h> #include <linux/fs.h> #include <linux/ktime.h> #include <linux/wait.h> @@ -40,6 +42,9 @@ struct ckpt_ctx { atomic_t refcount; struct ckpt_obj_hash *obj_hash; /* repository for shared objects */ + struct deferqueue_head *files_deferq; /* deferred file-table work */ + + struct path root_fs_path; /* container root (FIXME) */ struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 65ebec5..7902a51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t); extern void locks_remove_flock(struct file *); extern void locks_release_private(struct file_lock *); extern void posix_test_lock(struct file *, struct file_lock *); +extern int find_locks_with_owner(struct file *filp, fl_owner_t owner); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); extern int posix_lock_file_wait(struct file *, struct file_lock *); extern int posix_unblock_lock(struct file *, struct file_lock *); @@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner) return; } +static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + return -ENOENT; +} + static inline void locks_remove_flock(struct file *filp) { return; @@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#ifdef CONFIG_CHECKPOINT +extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file); +#else #define generic_file_checkpoint NULL +#endif extern int vfs_readdir(struct file *, filldir_t, void *); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 39/96] c/r: restore open file descriptors [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (2 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan ` (12 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the hash table; If not found in the hash table, (first occurence), read in 'struct ckpt_hdr_file', create a new file and register in the hash. Otherwise attach the file pointer from the hash as an FD. Changelog[v19-rc1]: - Fix lockdep complaint in restore_obj_files() Changelog[v19-rc1]: - Restore thread/cpu state early - Ensure null-termination of file names read from image - Fix compile warning in restore_open_fname() Changelog[v18]: - Invoke set_close_on_exec() unconditionally on restart Changelog[v17]: - Validate f_mode after restore against saved f_mode - Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - Introduce a per file-type restore() callback - Revert change to pr_debug(), back to ckpt_debug() - Rename: restore_files() => restore_fd_table() - Rename: ckpt_read_fd_data() => restore_file() - Check whether calls to ckpt_hbuf_get() fail - Discard field 'hh->parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 318 ++++++++++++++++++++++++++++++++++++++++++++ checkpoint/objhash.c | 2 + checkpoint/process.c | 20 +++ include/linux/checkpoint.h | 7 + 4 files changed, 347 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 7a57b24..b404c8f 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -16,6 +16,8 @@ #include <linux/sched.h> #include <linux/file.h> #include <linux/fdtable.h> +#include <linux/fsnotify.h> +#include <linux/syscalls.h> #include <linux/deferqueue.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -442,3 +444,319 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } + +/************************************************************************** + * Restart + */ + +/** + * restore_open_fname - read a file name and open a file + * @ctx: checkpoint context + * @flags: file flags + */ +struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags) +{ + struct file *file; + char *fname; + int len; + + /* prevent bad input from doing bad things */ + if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC)) + return ERR_PTR(-EINVAL); + + len = ckpt_read_payload(ctx, (void **) &fname, + PATH_MAX, CKPT_HDR_FILE_NAME); + if (len < 0) + return ERR_PTR(len); + fname[len - 1] = '\0'; /* always play if safe */ + ckpt_debug("fname '%s' flags %#x\n", fname, flags); + + file = filp_open(fname, flags, 0); + kfree(fname); + + return file; +} + +static int close_all_fds(struct files_struct *files) +{ + int *fdtable; + int nfds; + + nfds = scan_fds(files, &fdtable); + if (nfds < 0) + return nfds; + while (nfds--) + sys_close(fdtable[nfds]); + kfree(fdtable); + return 0; +} + +/** + * attach_file - attach a lonely file ptr to a file descriptor + * @file: lonely file pointer + */ +static int attach_file(struct file *file) +{ + int fd = get_unused_fd_flags(0); + + if (fd >= 0) { + get_file(file); + fsnotify_open(file->f_path.dentry); + fd_install(fd, file); + } + return fd; +} + +#define CKPT_SETFL_MASK \ + (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME) + +int restore_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + fmode_t new_mode = file->f_mode; + fmode_t saved_mode = (__force fmode_t) h->f_mode; + int ret; + + /* FIX: need to restore uid, gid, owner etc */ + + /* safe to set 1st arg (fd) to 0, as command is F_SETFL */ + ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file); + if (ret < 0) + return ret; + + /* + * Normally f_mode is set by open, and modified only via + * fcntl(), so its value now should match that at checkpoint. + * However, a file may be downgraded from (read-)write to + * read-only, e.g: + * - mark_files_ro() unsets FMODE_WRITE + * - nfs4_file_downgrade() too, and also sert FMODE_READ + * Validate the new f_mode against saved f_mode, allowing: + * - new with FMODE_WRITE, saved without FMODE_WRITE + * - new without FMODE_READ, saved with FMODE_READ + */ + if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) { + new_mode &= ~FMODE_WRITE; + if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ)) + new_mode |= FMODE_READ; + } + /* finally, at this point new mode should match saved mode */ + if (new_mode ^ saved_mode) + return -EINVAL; + + if (file->f_mode & FMODE_LSEEK) + ret = vfs_llseek(file, h->f_pos, SEEK_SET); + + return ret; +} + +static struct file *generic_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr) +{ + struct file *file; + int ret; + + if (ptr->h.type != CKPT_HDR_FILE || + ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC) + return ERR_PTR(-EINVAL); + + file = restore_open_fname(ctx, ptr->f_flags); + if (IS_ERR(file)) + return file; + + ret = restore_file_common(ctx, file, ptr); + if (ret < 0) { + fput(file); + file = ERR_PTR(ret); + } + return file; +} + +struct restore_file_ops { + char *file_name; + enum file_type file_type; + struct file * (*restore) (struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr); +}; + +static struct restore_file_ops restore_file_ops[] = { + /* ignored file */ + { + .file_name = "IGNORE", + .file_type = CKPT_FILE_IGNORE, + .restore = NULL, + }, + /* regular file/directory */ + { + .file_name = "GENERIC", + .file_type = CKPT_FILE_GENERIC, + .restore = generic_file_restore, + }, +}; + +static struct file *do_restore_file(struct ckpt_ctx *ctx) +{ + struct restore_file_ops *ops; + struct ckpt_hdr_file *h; + struct file *file = ERR_PTR(-EINVAL); + + /* + * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file, + * but the actual object depends on the file type. The length + * should never be more than page. + */ + h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE); + if (IS_ERR(h)) + return (struct file *) h; + ckpt_debug("flags %#x mode %#x type %d\n", + h->f_flags, h->f_mode, h->f_type); + + if (h->f_type >= CKPT_FILE_MAX) + goto out; + + ops = &restore_file_ops[h->f_type]; + BUG_ON(ops->file_type != h->f_type); + + if (ops->restore) + file = ops->restore(ctx, h); + out: + ckpt_hdr_put(ctx, h); + return file; +} + +/* restore callback for file pointer */ +void *restore_file(struct ckpt_ctx *ctx) +{ + return (void *) do_restore_file(ctx); +} + +/** + * ckpt_read_file_desc - restore the state of a given file descriptor + * @ctx: checkpoint context + * + * Restores the state of a file descriptor; looks up the objref (in the + * header) in the hash table, and if found picks the matching file and + * use it; otherwise calls restore_file to restore the file too. + */ +static int restore_file_desc(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_file_desc *h; + struct file *file; + int newfd, ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); + if (IS_ERR(h)) + return PTR_ERR(h); + ckpt_debug("ref %d fd %d c.o.e %d\n", + h->fd_objref, h->fd_descriptor, h->fd_close_on_exec); + + ret = -EINVAL; + if (h->fd_objref <= 0 || h->fd_descriptor < 0) + goto out; + + file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE); + if (IS_ERR(file)) { + ret = PTR_ERR(file); + goto out; + } + + newfd = attach_file(file); + if (newfd < 0) { + ret = newfd; + goto out; + } + + ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor); + + /* reposition if newfd isn't desired fd */ + if (newfd != h->fd_descriptor) { + ret = sys_dup2(newfd, h->fd_descriptor); + if (ret < 0) + goto out; + sys_close(newfd); + } + + set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec); + ret = 0; + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +/* restore callback for file table */ +static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_file_table *h; + struct files_struct *files; + int i, ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE); + if (IS_ERR(h)) + return (struct files_struct *) h; + + ckpt_debug("nfds %d\n", h->fdt_nfds); + + ret = -EMFILE; + if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open) + goto out; + + /* + * We assume that restarting tasks, as created in user-space, + * have distinct files_struct objects each. If not, we need to + * call dup_fd() to make sure we don't overwrite an already + * restored one. + */ + + /* point of no return -- close all file descriptors */ + ret = close_all_fds(current->files); + if (ret < 0) + goto out; + + for (i = 0; i < h->fdt_nfds; i++) { + ret = restore_file_desc(ctx); + if (ret < 0) + goto out; + } + + ret = deferqueue_run(ctx->files_deferq); + ckpt_debug("files_deferq ran %d entries\n", ret); + if (ret > 0) + ret = 0; + out: + ckpt_hdr_put(ctx, h); + if (!ret) { + files = current->files; + atomic_inc(&files->count); + } else { + files = ERR_PTR(ret); + } + return files; +} + +void *restore_file_table(struct ckpt_ctx *ctx) +{ + return (void *) do_restore_file_table(ctx); +} + +int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref) +{ + struct files_struct *files; + + files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE); + if (IS_ERR(files)) + return PTR_ERR(files); + + if (files != current->files) { + struct files_struct *prev; + + task_lock(current); + prev = current->files; + current->files = files; + atomic_inc(&files->count); + task_unlock(current); + + put_files_struct(prev); + } + + return 0; +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index f25d130..cacc4c7 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -112,6 +112,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_grab = obj_file_table_grab, .ref_users = obj_file_table_users, .checkpoint = checkpoint_file_table, + .restore = restore_file_table, }, /* file object */ { @@ -121,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_grab = obj_file_grab, .ref_users = obj_file_users, .checkpoint = checkpoint_file, + .restore = restore_file, }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index adc34a2..23e0296 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -348,6 +348,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx) return ret; } +static int restore_task_objs(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_task_objs *h; + int ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS); + if (IS_ERR(h)) + return PTR_ERR(h); + + ret = restore_obj_file_table(ctx, h->files_objref); + ckpt_debug("file_table: ret %d (%p)\n", ret, current->files); + + ckpt_hdr_put(ctx, h); + return ret; +} + int restore_restart_block(struct ckpt_ctx *ctx) { struct ckpt_hdr_restart_block *h; @@ -477,6 +493,10 @@ int restore_task(struct ckpt_ctx *ctx) goto out; ret = restore_cpu(ctx); ckpt_debug("cpu %d\n", ret); + if (ret < 0) + goto out; + ret = restore_task_objs(ctx); + ckpt_debug("objs %d\n", ret); out: return ret; } diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index d74a890..749f30c 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -163,16 +163,23 @@ extern int restore_restart_block(struct ckpt_ctx *ctx); extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t); extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t); +extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref); extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr); +extern void *restore_file_table(struct ckpt_ctx *ctx); /* files */ extern int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root); +extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags); + extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file); extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr); +extern void *restore_file(struct ckpt_ctx *ctx); extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, struct ckpt_hdr_file *h); +extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h); static inline int ckpt_validate_errno(int errno) { -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (3 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan ` (11 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger Changelog[v17] - Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- include/linux/mm.h | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 60c467b..48d67ee 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -19,6 +19,7 @@ struct file_ra_state; struct user_struct; struct writeback_control; struct rlimit; +struct ckpt_ctx; #ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */ extern unsigned long max_mapnr; @@ -220,6 +221,9 @@ struct vm_operations_struct { int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, const nodemask_t *to, unsigned long flags); #endif +#ifdef CONFIG_CHECKPOINT + int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma); +#endif }; struct mmu_gather; -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (4 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan ` (10 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger, Dave Hansen From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> This marks ext[234] as being checkpointable. There will be many more to do this to, but this is a start. Changelog[ckpt-v19-rc3]: - Rebase to kernel 2.6.33 (ext2) Changelog[v1]: - [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- fs/ext2/dir.c | 1 + fs/ext2/file.c | 2 ++ fs/ext3/dir.c | 1 + fs/ext3/file.c | 1 + fs/ext4/dir.c | 1 + fs/ext4/file.c | 4 ++++ 6 files changed, 10 insertions(+), 0 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 7516957..84c17f9 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = { .compat_ioctl = ext2_compat_ioctl, #endif .fsync = ext2_fsync, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 586e358..b38d7b9 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -75,6 +75,7 @@ const struct file_operations ext2_file_operations = { .fsync = ext2_fsync, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; #ifdef CONFIG_EXT2_FS_XIP @@ -90,6 +91,7 @@ const struct file_operations ext2_xip_file_operations = { .open = generic_file_open, .release = ext2_release_file, .fsync = ext2_fsync, + .checkpoint = generic_file_checkpoint, }; #endif diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index 373fa90..65f98af 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = { #endif .fsync = ext3_sync_file, /* BKL held */ .release = ext3_release_dir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext3/file.c b/fs/ext3/file.c index 388bbdf..bcd9b88 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -67,6 +67,7 @@ const struct file_operations ext3_file_operations = { .fsync = ext3_sync_file, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ext3_file_inode_operations = { diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index 9dc9316..f69404c 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = { #endif .fsync = ext4_sync_file, .release = ext4_release_dir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 9630583..93a129b 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -84,6 +84,9 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov, static const struct vm_operations_struct ext4_file_vm_ops = { .fault = filemap_fault, .page_mkwrite = ext4_page_mkwrite, +#ifdef CONFIG_CHECKPOINT + .checkpoint = filemap_checkpoint, +#endif }; static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) @@ -146,6 +149,7 @@ const struct file_operations ext4_file_operations = { .fsync = ext4_sync_file, .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ext4_file_inode_operations = { -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (5 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan ` (9 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger * /dev/null * /dev/zero * /dev/random * /dev/urandom Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- drivers/char/mem.c | 2 ++ drivers/char/random.c | 2 ++ 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 48788db..57e3443 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -763,6 +763,7 @@ static const struct file_operations null_fops = { .read = read_null, .write = write_null, .splice_write = splice_write_null, + .checkpoint = generic_file_checkpoint, }; #ifdef CONFIG_DEVPORT @@ -779,6 +780,7 @@ static const struct file_operations zero_fops = { .read = read_zero, .write = write_zero, .mmap = mmap_zero, + .checkpoint = generic_file_checkpoint, }; /* diff --git a/drivers/char/random.c b/drivers/char/random.c index 2849713..c082789 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1169,6 +1169,7 @@ const struct file_operations random_fops = { .poll = random_poll, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, + .checkpoint = generic_file_checkpoint, }; const struct file_operations urandom_fops = { @@ -1176,6 +1177,7 @@ const struct file_operations urandom_fops = { .write = random_write, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, + .checkpoint = generic_file_checkpoint, }; /*************************************************************** -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (6 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan ` (8 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> These patches extend the use of the generic file checkpoint operation to non-extX filesystems which have lseek operations that ensure we can save and restore the files for later use. Note that this does not include things like FUSE, network filesystems, or pseudo-filesystem kernel interfaces. Only compile and boot tested (on x86-32). [Oren Laadan] Folded patch series into a single patch; original post included 36 separate patches for individual filesystems: [PATCH 01/36] Add the checkpoint operation for affs files and directories. [PATCH 02/36] Add the checkpoint operation for befs directories. [PATCH 03/36] Add the checkpoint operation for bfs files and directories. [PATCH 04/36] Add the checkpoint operation for btrfs files and directories. [PATCH 05/36] Add the checkpoint operation for cramfs directories. [PATCH 06/36] Add the checkpoint operation for ecryptfs files and directories. [PATCH 07/36] Add the checkpoint operation for fat files and directories. [PATCH 08/36] Add the checkpoint operation for freevxfs directories. [PATCH 09/36] Add the checkpoint operation for hfs files and directories. [PATCH 10/36] Add the checkpoint operation for hfsplus files and directories. [PATCH 11/36] Add the checkpoint operation for hpfs files and directories. [PATCH 12/36] Add the checkpoint operation for hppfs files and directories. [PATCH 13/36] Add the checkpoint operation for iso directories. [PATCH 14/36] Add the checkpoint operation for jffs2 files and directories. [PATCH 15/36] Add the checkpoint operation for jfs files and directories. [PATCH 16/36] Add the checkpoint operation for regular nfs files and directories. Skip the various /proc files for now. [PATCH 17/36] Add the checkpoint operation for ntfs directories. [PATCH 18/36] Add the checkpoint operation for openromfs directories. Explicitly skip the properties for now. [PATCH 19/36] Add the checkpoint operation for qnx4 files and directories. [PATCH 20/36] Add the checkpoint operation for reiserfs files and directories. [PATCH 21/36] Add the checkpoint operation for romfs directories. [PATCH 22/36] Add the checkpoint operation for squashfs directories. [PATCH 23/36] Add the checkpoint operation for sysv filesystem files and directories. [PATCH 24/36] Add the checkpoint operation for ubifs files and directories. [PATCH 25/36] Add the checkpoint operation for udf filesystem files and directories. [PATCH 26/36] Add the checkpoint operation for xfs files and directories. [PATCH 27/36] Add checkpoint operation for efs directories. [PATCH 28/36] Add the checkpoint operation for generic, read-only files. At present, some/all files of the following filesystems use this generic definition: [PATCH 29/36] Add checkpoint operation for minix filesystem files and directories. [PATCH 30/36] Add checkpoint operations for omfs files and directories. [PATCH 31/36] Add checkpoint operations for ufs files and directories. [PATCH 32/36] Add checkpoint operations for ramfs files. NOTE: since simple_dir_operations are shared between multiple filesystems including ramfs, it's not currently possible to checkpoint open ramfs directories. [PATCH 33/36] Add the checkpoint operation for adfs files and directories. [PATCH 34/36] Add the checkpoint operation to exofs files and directories. [PATCH 35/36] Add the checkpoint operation to nilfs2 files and directories. [PATCH 36/36] Add checkpoint operations for UML host filesystem files and directories. Changelog[v19-rc3]: - [Suka] Enable C/R while executing over NFS Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- fs/adfs/dir.c | 1 + fs/adfs/file.c | 1 + fs/affs/dir.c | 1 + fs/affs/file.c | 1 + fs/befs/linuxvfs.c | 1 + fs/bfs/dir.c | 1 + fs/bfs/file.c | 1 + fs/btrfs/file.c | 1 + fs/btrfs/inode.c | 1 + fs/btrfs/super.c | 1 + fs/cramfs/inode.c | 1 + fs/ecryptfs/file.c | 2 ++ fs/ecryptfs/miscdev.c | 1 + fs/efs/dir.c | 1 + fs/exofs/dir.c | 1 + fs/exofs/file.c | 1 + fs/fat/dir.c | 1 + fs/fat/file.c | 1 + fs/freevxfs/vxfs_lookup.c | 1 + fs/hfs/dir.c | 1 + fs/hfs/inode.c | 1 + fs/hfsplus/dir.c | 1 + fs/hfsplus/inode.c | 1 + fs/hostfs/hostfs_kern.c | 2 ++ fs/hpfs/dir.c | 1 + fs/hpfs/file.c | 1 + fs/hppfs/hppfs.c | 2 ++ fs/isofs/dir.c | 1 + fs/jffs2/dir.c | 1 + fs/jffs2/file.c | 1 + fs/jfs/file.c | 1 + fs/jfs/namei.c | 1 + fs/minix/dir.c | 1 + fs/minix/file.c | 1 + fs/nfs/dir.c | 1 + fs/nfs/file.c | 4 ++++ fs/nilfs2/dir.c | 2 +- fs/nilfs2/file.c | 1 + fs/ntfs/dir.c | 1 + fs/ntfs/file.c | 3 ++- fs/omfs/dir.c | 1 + fs/omfs/file.c | 1 + fs/openpromfs/inode.c | 2 ++ fs/qnx4/dir.c | 1 + fs/ramfs/file-mmu.c | 1 + fs/ramfs/file-nommu.c | 1 + fs/read_write.c | 1 + fs/reiserfs/dir.c | 1 + fs/reiserfs/file.c | 1 + fs/romfs/mmap-nommu.c | 1 + fs/romfs/super.c | 1 + fs/squashfs/dir.c | 3 ++- fs/sysv/dir.c | 1 + fs/sysv/file.c | 1 + fs/ubifs/debug.c | 1 + fs/ubifs/dir.c | 1 + fs/ubifs/file.c | 1 + fs/udf/dir.c | 1 + fs/udf/file.c | 1 + fs/ufs/dir.c | 1 + fs/ufs/file.c | 1 + fs/xfs/linux-2.6/xfs_file.c | 2 ++ 62 files changed, 72 insertions(+), 3 deletions(-) diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c index 23aa52f..7106f32 100644 --- a/fs/adfs/dir.c +++ b/fs/adfs/dir.c @@ -198,6 +198,7 @@ const struct file_operations adfs_dir_operations = { .llseek = generic_file_llseek, .readdir = adfs_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; static int diff --git a/fs/adfs/file.c b/fs/adfs/file.c index 005ea34..97bd298 100644 --- a/fs/adfs/file.c +++ b/fs/adfs/file.c @@ -30,6 +30,7 @@ const struct file_operations adfs_file_operations = { .write = do_sync_write, .aio_write = generic_file_aio_write, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations adfs_file_inode_operations = { diff --git a/fs/affs/dir.c b/fs/affs/dir.c index 8ca8f3a..6cc5e43 100644 --- a/fs/affs/dir.c +++ b/fs/affs/dir.c @@ -22,6 +22,7 @@ const struct file_operations affs_dir_operations = { .llseek = generic_file_llseek, .readdir = affs_readdir, .fsync = affs_file_fsync, + .checkpoint = generic_file_checkpoint, }; /* diff --git a/fs/affs/file.c b/fs/affs/file.c index 184e55c..d580a12 100644 --- a/fs/affs/file.c +++ b/fs/affs/file.c @@ -36,6 +36,7 @@ const struct file_operations affs_file_operations = { .release = affs_file_release, .fsync = affs_file_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations affs_file_inode_operations = { diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index 34ddda8..b97f79b 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -67,6 +67,7 @@ static const struct file_operations befs_dir_operations = { .read = generic_read_dir, .readdir = befs_readdir, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations befs_dir_inode_operations = { diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c index 1e41aad..d78015e 100644 --- a/fs/bfs/dir.c +++ b/fs/bfs/dir.c @@ -80,6 +80,7 @@ const struct file_operations bfs_dir_operations = { .readdir = bfs_readdir, .fsync = simple_fsync, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; extern void dump_imap(const char *, struct super_block *); diff --git a/fs/bfs/file.c b/fs/bfs/file.c index 88b9a3f..7f61ed6 100644 --- a/fs/bfs/file.c +++ b/fs/bfs/file.c @@ -29,6 +29,7 @@ const struct file_operations bfs_file_operations = { .aio_write = generic_file_aio_write, .mmap = generic_file_mmap, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; static int bfs_move_block(unsigned long from, unsigned long to, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 6ed434a..281a2b8 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1164,4 +1164,5 @@ const struct file_operations btrfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = btrfs_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4deb280..606c31d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5971,6 +5971,7 @@ static const struct file_operations btrfs_dir_file_operations = { #endif .release = btrfs_release_file, .fsync = btrfs_sync_file, + .checkpoint = generic_file_checkpoint, }; static struct extent_io_ops btrfs_extent_io_ops = { diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8a1ea6e..7a28ac5 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -718,6 +718,7 @@ static const struct file_operations btrfs_ctl_fops = { .unlocked_ioctl = btrfs_control_ioctl, .compat_ioctl = btrfs_control_ioctl, .owner = THIS_MODULE, + .checkpoint = generic_file_checkpoint, }; static struct miscdevice btrfs_misc = { diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c index dd3634e..0927503 100644 --- a/fs/cramfs/inode.c +++ b/fs/cramfs/inode.c @@ -532,6 +532,7 @@ static const struct file_operations cramfs_directory_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, .readdir = cramfs_readdir, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations cramfs_dir_inode_operations = { diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c index 678172b..a8973ef 100644 --- a/fs/ecryptfs/file.c +++ b/fs/ecryptfs/file.c @@ -305,6 +305,7 @@ const struct file_operations ecryptfs_dir_fops = { .fsync = ecryptfs_fsync, .fasync = ecryptfs_fasync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct file_operations ecryptfs_main_fops = { @@ -322,6 +323,7 @@ const struct file_operations ecryptfs_main_fops = { .fsync = ecryptfs_fsync, .fasync = ecryptfs_fasync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; static int diff --git a/fs/ecryptfs/miscdev.c b/fs/ecryptfs/miscdev.c index 4ec8f61..9fd9b39 100644 --- a/fs/ecryptfs/miscdev.c +++ b/fs/ecryptfs/miscdev.c @@ -481,6 +481,7 @@ static const struct file_operations ecryptfs_miscdev_fops = { .read = ecryptfs_miscdev_read, .write = ecryptfs_miscdev_write, .release = ecryptfs_miscdev_release, + .checkpoint = generic_file_checkpoint, }; static struct miscdevice ecryptfs_miscdev = { diff --git a/fs/efs/dir.c b/fs/efs/dir.c index 7ee6f7e..da344b8 100644 --- a/fs/efs/dir.c +++ b/fs/efs/dir.c @@ -13,6 +13,7 @@ const struct file_operations efs_dir_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, .readdir = efs_readdir, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations efs_dir_inode_operations = { diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c index 4cfab1c..f6693d3 100644 --- a/fs/exofs/dir.c +++ b/fs/exofs/dir.c @@ -667,4 +667,5 @@ const struct file_operations exofs_dir_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, .readdir = exofs_readdir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/exofs/file.c b/fs/exofs/file.c index 839b9dc..257e9da 100644 --- a/fs/exofs/file.c +++ b/fs/exofs/file.c @@ -73,6 +73,7 @@ static int exofs_flush(struct file *file, fl_owner_t id) const struct file_operations exofs_file_operations = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, diff --git a/fs/fat/dir.c b/fs/fat/dir.c index 530b4ca..e3fa353 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -841,6 +841,7 @@ const struct file_operations fat_dir_operations = { .compat_ioctl = fat_compat_dir_ioctl, #endif .fsync = fat_file_fsync, + .checkpoint = generic_file_checkpoint, }; static int fat_get_short_entry(struct inode *dir, loff_t *pos, diff --git a/fs/fat/file.c b/fs/fat/file.c index e8c159d..e5aecc6 100644 --- a/fs/fat/file.c +++ b/fs/fat/file.c @@ -162,6 +162,7 @@ const struct file_operations fat_file_operations = { .ioctl = fat_generic_ioctl, .fsync = fat_file_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; static int fat_cont_expand(struct inode *inode, loff_t size) diff --git a/fs/freevxfs/vxfs_lookup.c b/fs/freevxfs/vxfs_lookup.c index aee049c..3a09132 100644 --- a/fs/freevxfs/vxfs_lookup.c +++ b/fs/freevxfs/vxfs_lookup.c @@ -58,6 +58,7 @@ const struct inode_operations vxfs_dir_inode_ops = { const struct file_operations vxfs_dir_operations = { .readdir = vxfs_readdir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c index 2b3b861..0eef6c2 100644 --- a/fs/hfs/dir.c +++ b/fs/hfs/dir.c @@ -329,6 +329,7 @@ const struct file_operations hfs_dir_operations = { .readdir = hfs_readdir, .llseek = generic_file_llseek, .release = hfs_dir_release, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations hfs_dir_inode_operations = { diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c index a1cbff2..bf8950f 100644 --- a/fs/hfs/inode.c +++ b/fs/hfs/inode.c @@ -607,6 +607,7 @@ static const struct file_operations hfs_file_operations = { .fsync = file_fsync, .open = hfs_file_open, .release = hfs_file_release, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations hfs_file_inode_operations = { diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c index 5f40236..41fbf2d 100644 --- a/fs/hfsplus/dir.c +++ b/fs/hfsplus/dir.c @@ -497,4 +497,5 @@ const struct file_operations hfsplus_dir_operations = { .ioctl = hfsplus_ioctl, .llseek = generic_file_llseek, .release = hfsplus_dir_release, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 1bcf597..19abd7e 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -286,6 +286,7 @@ static const struct file_operations hfsplus_file_operations = { .open = hfsplus_file_open, .release = hfsplus_file_release, .ioctl = hfsplus_ioctl, + .checkpoint = generic_file_checkpoint, }; struct inode *hfsplus_new_inode(struct super_block *sb, int mode) diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c index 032604e..67e2356 100644 --- a/fs/hostfs/hostfs_kern.c +++ b/fs/hostfs/hostfs_kern.c @@ -417,6 +417,7 @@ int hostfs_fsync(struct file *file, struct dentry *dentry, int datasync) static const struct file_operations hostfs_file_fops = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .read = do_sync_read, .splice_read = generic_file_splice_read, .aio_read = generic_file_aio_read, @@ -430,6 +431,7 @@ static const struct file_operations hostfs_file_fops = { static const struct file_operations hostfs_dir_fops = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .readdir = hostfs_readdir, .read = generic_read_dir, }; diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c index 8865c94..dcde10f 100644 --- a/fs/hpfs/dir.c +++ b/fs/hpfs/dir.c @@ -322,4 +322,5 @@ const struct file_operations hpfs_dir_ops = .readdir = hpfs_readdir, .release = hpfs_dir_release, .fsync = hpfs_file_fsync, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c index 3efabff..f1211f0 100644 --- a/fs/hpfs/file.c +++ b/fs/hpfs/file.c @@ -139,6 +139,7 @@ const struct file_operations hpfs_file_ops = .release = hpfs_file_release, .fsync = hpfs_file_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations hpfs_file_iops = diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c index 7239efc..e3c3bd3 100644 --- a/fs/hppfs/hppfs.c +++ b/fs/hppfs/hppfs.c @@ -546,6 +546,7 @@ static const struct file_operations hppfs_file_fops = { .read = hppfs_read, .write = hppfs_write, .open = hppfs_open, + .checkpoint = generic_file_checkpoint, }; struct hppfs_dirent { @@ -597,6 +598,7 @@ static const struct file_operations hppfs_dir_fops = { .readdir = hppfs_readdir, .open = hppfs_dir_open, .fsync = hppfs_fsync, + .checkpoint = generic_file_checkpoint, }; static int hppfs_statfs(struct dentry *dentry, struct kstatfs *sf) diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c index 8ba5441..848059d 100644 --- a/fs/isofs/dir.c +++ b/fs/isofs/dir.c @@ -273,6 +273,7 @@ const struct file_operations isofs_dir_operations = { .read = generic_read_dir, .readdir = isofs_readdir, + .checkpoint = generic_file_checkpoint, }; /* diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index 7aa4417..c7c4dcb 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -41,6 +41,7 @@ const struct file_operations jffs2_dir_operations = .unlocked_ioctl=jffs2_ioctl, .fsync = jffs2_fsync, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c index b7b74e2..f01038d 100644 --- a/fs/jffs2/file.c +++ b/fs/jffs2/file.c @@ -50,6 +50,7 @@ const struct file_operations jffs2_file_operations = .mmap = generic_file_readonly_mmap, .fsync = jffs2_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; /* jffs2_file_inode_operations */ diff --git a/fs/jfs/file.c b/fs/jfs/file.c index 2b70fa7..3bd7114 100644 --- a/fs/jfs/file.c +++ b/fs/jfs/file.c @@ -116,4 +116,5 @@ const struct file_operations jfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = jfs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index c79a427..585a7d2 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -1556,6 +1556,7 @@ const struct file_operations jfs_dir_operations = { .compat_ioctl = jfs_compat_ioctl, #endif .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; static int jfs_ci_hash(struct dentry *dir, struct qstr *this) diff --git a/fs/minix/dir.c b/fs/minix/dir.c index 6198731..74b6fb4 100644 --- a/fs/minix/dir.c +++ b/fs/minix/dir.c @@ -23,6 +23,7 @@ const struct file_operations minix_dir_operations = { .read = generic_read_dir, .readdir = minix_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; static inline void dir_put_page(struct page *page) diff --git a/fs/minix/file.c b/fs/minix/file.c index 3eec3e6..2048d09 100644 --- a/fs/minix/file.c +++ b/fs/minix/file.c @@ -21,6 +21,7 @@ const struct file_operations minix_file_operations = { .mmap = generic_file_mmap, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations minix_file_inode_operations = { diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 3c7f03b..7d9d22a 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -63,6 +63,7 @@ const struct file_operations nfs_dir_operations = { .open = nfs_opendir, .release = nfs_release, .fsync = nfs_fsync_dir, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations nfs_dir_inode_operations = { diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 63f2071..4437ef9 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -78,6 +78,7 @@ const struct file_operations nfs_file_operations = { .splice_write = nfs_file_splice_write, .check_flags = nfs_check_flags, .setlease = nfs_setlease, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations nfs_file_inode_operations = { @@ -577,6 +578,9 @@ out_unlock: static const struct vm_operations_struct nfs_file_vm_ops = { .fault = filemap_fault, .page_mkwrite = nfs_vm_page_mkwrite, +#ifdef CONFIG_CHECKPOINT + .checkpoint = filemap_checkpoint, +#endif }; static int nfs_need_sync_write(struct file *filp, struct inode *inode) diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c index 76d803e..18b2171 100644 --- a/fs/nilfs2/dir.c +++ b/fs/nilfs2/dir.c @@ -702,5 +702,5 @@ const struct file_operations nilfs_dir_operations = { .compat_ioctl = nilfs_ioctl, #endif /* CONFIG_COMPAT */ .fsync = nilfs_sync_file, - + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c index 30292df..4d585b5 100644 --- a/fs/nilfs2/file.c +++ b/fs/nilfs2/file.c @@ -136,6 +136,7 @@ static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma) */ const struct file_operations nilfs_file_operations = { .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c index 5a9e344..4fe3759 100644 --- a/fs/ntfs/dir.c +++ b/fs/ntfs/dir.c @@ -1572,4 +1572,5 @@ const struct file_operations ntfs_dir_ops = { /*.ioctl = ,*/ /* Perform function on the mounted filesystem. */ .open = ntfs_dir_open, /* Open directory. */ + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c index 43179dd..32a43f5 100644 --- a/fs/ntfs/file.c +++ b/fs/ntfs/file.c @@ -2224,7 +2224,7 @@ const struct file_operations ntfs_file_ops = { mounted filesystem. */ .mmap = generic_file_mmap, /* Mmap file. */ .open = ntfs_file_open, /* Open file. */ - .splice_read = generic_file_splice_read /* Zero-copy data send with + .splice_read = generic_file_splice_read, /* Zero-copy data send with the data source being on the ntfs partition. We do not need to care about the @@ -2234,6 +2234,7 @@ const struct file_operations ntfs_file_ops = { on the ntfs partition. We do not need to care about the data source. */ + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ntfs_file_inode_ops = { diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c index b42d624..e924e33 100644 --- a/fs/omfs/dir.c +++ b/fs/omfs/dir.c @@ -502,4 +502,5 @@ const struct file_operations omfs_dir_operations = { .read = generic_read_dir, .readdir = omfs_readdir, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/omfs/file.c b/fs/omfs/file.c index 399487c..83e63ef 100644 --- a/fs/omfs/file.c +++ b/fs/omfs/file.c @@ -331,6 +331,7 @@ const struct file_operations omfs_file_operations = { .mmap = generic_file_mmap, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations omfs_file_inops = { diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c index ffcd04f..d1f0677 100644 --- a/fs/openpromfs/inode.c +++ b/fs/openpromfs/inode.c @@ -160,6 +160,7 @@ static const struct file_operations openpromfs_prop_ops = { .read = seq_read, .llseek = seq_lseek, .release = seq_release, + .checkpoint = NULL, }; static int openpromfs_readdir(struct file *, void *, filldir_t); @@ -168,6 +169,7 @@ static const struct file_operations openprom_operations = { .read = generic_read_dir, .readdir = openpromfs_readdir, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; static struct dentry *openpromfs_lookup(struct inode *, struct dentry *, struct nameidata *); diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c index 6f30c3d..fa14c55 100644 --- a/fs/qnx4/dir.c +++ b/fs/qnx4/dir.c @@ -80,6 +80,7 @@ const struct file_operations qnx4_dir_operations = .read = generic_read_dir, .readdir = qnx4_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations qnx4_dir_inode_operations = diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c index 78f613c..4430239 100644 --- a/fs/ramfs/file-mmu.c +++ b/fs/ramfs/file-mmu.c @@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = { .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ramfs_file_inode_operations = { diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c index 1739a4a..9cd6208 100644 --- a/fs/ramfs/file-nommu.c +++ b/fs/ramfs/file-nommu.c @@ -45,6 +45,7 @@ const struct file_operations ramfs_file_operations = { .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ramfs_file_inode_operations = { diff --git a/fs/read_write.c b/fs/read_write.c index e258301..65371e1 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -27,6 +27,7 @@ const struct file_operations generic_ro_fops = { .aio_read = generic_file_aio_read, .mmap = generic_file_readonly_mmap, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; EXPORT_SYMBOL(generic_ro_fops); diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c index c094f58..8681419 100644 --- a/fs/reiserfs/dir.c +++ b/fs/reiserfs/dir.c @@ -24,6 +24,7 @@ const struct file_operations reiserfs_dir_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = reiserfs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; static int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry, diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c index da2dba0..b6008f3 100644 --- a/fs/reiserfs/file.c +++ b/fs/reiserfs/file.c @@ -297,6 +297,7 @@ const struct file_operations reiserfs_file_operations = { .splice_read = generic_file_splice_read, .splice_write = generic_file_splice_write, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations reiserfs_file_inode_operations = { diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c index f0511e8..03c24d9 100644 --- a/fs/romfs/mmap-nommu.c +++ b/fs/romfs/mmap-nommu.c @@ -72,4 +72,5 @@ const struct file_operations romfs_ro_fops = { .splice_read = generic_file_splice_read, .mmap = romfs_mmap, .get_unmapped_area = romfs_get_unmapped_area, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/romfs/super.c b/fs/romfs/super.c index 42d2135..476ea8e 100644 --- a/fs/romfs/super.c +++ b/fs/romfs/super.c @@ -282,6 +282,7 @@ error: static const struct file_operations romfs_dir_operations = { .read = generic_read_dir, .readdir = romfs_readdir, + .checkpoint = generic_file_checkpoint, }; static const struct inode_operations romfs_dir_inode_operations = { diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c index 566b0ea..b0c5336 100644 --- a/fs/squashfs/dir.c +++ b/fs/squashfs/dir.c @@ -231,5 +231,6 @@ failed_read: const struct file_operations squashfs_dir_ops = { .read = generic_read_dir, - .readdir = squashfs_readdir + .readdir = squashfs_readdir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c index 4e50286..53acd29 100644 --- a/fs/sysv/dir.c +++ b/fs/sysv/dir.c @@ -25,6 +25,7 @@ const struct file_operations sysv_dir_operations = { .read = generic_read_dir, .readdir = sysv_readdir, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; static inline void dir_put_page(struct page *page) diff --git a/fs/sysv/file.c b/fs/sysv/file.c index 96340c0..aee556d 100644 --- a/fs/sysv/file.c +++ b/fs/sysv/file.c @@ -28,6 +28,7 @@ const struct file_operations sysv_file_operations = { .mmap = generic_file_mmap, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations sysv_file_inode_operations = { diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c index 9049232..e4f23c6 100644 --- a/fs/ubifs/debug.c +++ b/fs/ubifs/debug.c @@ -2623,6 +2623,7 @@ static ssize_t write_debugfs_file(struct file *file, const char __user *buf, static const struct file_operations dfs_fops = { .open = open_debugfs_file, .write = write_debugfs_file, + .checkpoint = generic_file_checkpoint, .owner = THIS_MODULE, }; diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c index 552fb01..89ab2aa 100644 --- a/fs/ubifs/dir.c +++ b/fs/ubifs/dir.c @@ -1228,4 +1228,5 @@ const struct file_operations ubifs_dir_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ubifs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index 16a6444..254a4d9 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1582,4 +1582,5 @@ const struct file_operations ubifs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ubifs_compat_ioctl, #endif + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/udf/dir.c b/fs/udf/dir.c index 61d9a76..6586dbe 100644 --- a/fs/udf/dir.c +++ b/fs/udf/dir.c @@ -211,4 +211,5 @@ const struct file_operations udf_dir_operations = { .readdir = udf_readdir, .ioctl = udf_ioctl, .fsync = simple_fsync, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/udf/file.c b/fs/udf/file.c index f311d50..e671552 100644 --- a/fs/udf/file.c +++ b/fs/udf/file.c @@ -215,6 +215,7 @@ const struct file_operations udf_file_operations = { .fsync = simple_fsync, .splice_read = generic_file_splice_read, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations udf_file_inode_operations = { diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c index 22af68f..29c9396 100644 --- a/fs/ufs/dir.c +++ b/fs/ufs/dir.c @@ -668,4 +668,5 @@ const struct file_operations ufs_dir_operations = { .readdir = ufs_readdir, .fsync = simple_fsync, .llseek = generic_file_llseek, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ufs/file.c b/fs/ufs/file.c index 73655c6..15c8616 100644 --- a/fs/ufs/file.c +++ b/fs/ufs/file.c @@ -43,4 +43,5 @@ const struct file_operations ufs_file_operations = { .open = generic_file_open, .fsync = simple_fsync, .splice_read = generic_file_splice_read, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index e4caeb2..926f377 100644 --- a/fs/xfs/linux-2.6/xfs_file.c +++ b/fs/xfs/linux-2.6/xfs_file.c @@ -259,6 +259,7 @@ const struct file_operations xfs_file_operations = { #ifdef HAVE_FOP_OPEN_EXEC .open_exec = xfs_file_open_exec, #endif + .checkpoint = generic_file_checkpoint, }; const struct file_operations xfs_dir_file_operations = { @@ -271,6 +272,7 @@ const struct file_operations xfs_dir_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .fsync = xfs_file_fsync, + .checkpoint = generic_file_checkpoint, }; static const struct vm_operations_struct xfs_file_vm_ops = { -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (7 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan ` (7 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger During pipes c/r pipes we need to save and restore pipe buffers. But do_splice() requires two file descriptors, therefore we can't use it, as we always have one file descriptor (checkpoint image) and one pipe_inode_info. This patch exports interfaces that work at the pipe_inode_info level, namely link_pipe(), do_splice_to() and do_splice_from(). They are used in the following patch to to save and restore pipe buffers without unnecessary data copy. It slightly modifies both do_splice_to() and do_splice_from() to detect the case of pipe-to-pipe transfer, in which case they invoke splice_pipe_to_pipe() directly. Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- fs/splice.c | 61 ++++++++++++++++++++++++++++++++--------------- include/linux/splice.h | 9 +++++++ 2 files changed, 50 insertions(+), 20 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 3920866..76acb55 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1051,18 +1051,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, EXPORT_SYMBOL(generic_splice_sendpage); /* + * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same + * location, so checking ->i_pipe is not enough to verify that this is a + * pipe. + */ +static inline struct pipe_inode_info *pipe_info(struct inode *inode) +{ + if (S_ISFIFO(inode->i_mode)) + return inode->i_pipe; + + return NULL; +} + +static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + size_t len, unsigned int flags); + +/* * Attempt to initiate a splice from pipe to file. */ -static long do_splice_from(struct pipe_inode_info *pipe, struct file *out, - loff_t *ppos, size_t len, unsigned int flags) +long do_splice_from(struct pipe_inode_info *pipe, struct file *out, + loff_t *ppos, size_t len, unsigned int flags) { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); + struct pipe_inode_info *opipe; int ret; if (unlikely(!(out->f_mode & FMODE_WRITE))) return -EBADF; + /* When called directly (e.g. from c/r) output may be a pipe */ + opipe = pipe_info(out->f_path.dentry->d_inode); + if (opipe) { + BUG_ON(opipe == pipe); + return splice_pipe_to_pipe(pipe, opipe, len, flags); + } + if (unlikely(out->f_flags & O_APPEND)) return -EINVAL; @@ -1081,17 +1106,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out, /* * Attempt to initiate a splice from a file to a pipe. */ -static long do_splice_to(struct file *in, loff_t *ppos, - struct pipe_inode_info *pipe, size_t len, - unsigned int flags) +long do_splice_to(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) { ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); + struct pipe_inode_info *ipipe; int ret; if (unlikely(!(in->f_mode & FMODE_READ))) return -EBADF; + /* When called firectly (e.g. from c/r) input may be a pipe */ + ipipe = pipe_info(in->f_path.dentry->d_inode); + if (ipipe) { + BUG_ON(ipipe == pipe); + return splice_pipe_to_pipe(ipipe, pipe, len, flags); + } + ret = rw_verify_area(READ, in, ppos, len); if (unlikely(ret < 0)) return ret; @@ -1271,18 +1304,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, struct pipe_inode_info *opipe, size_t len, unsigned int flags); -/* - * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same - * location, so checking ->i_pipe is not enough to verify that this is a - * pipe. - */ -static inline struct pipe_inode_info *pipe_info(struct inode *inode) -{ - if (S_ISFIFO(inode->i_mode)) - return inode->i_pipe; - - return NULL; -} /* * Determine where to splice to/from. @@ -1887,9 +1908,9 @@ retry: /* * Link contents of ipipe to opipe. */ -static int link_pipe(struct pipe_inode_info *ipipe, - struct pipe_inode_info *opipe, - size_t len, unsigned int flags) +int link_pipe(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + size_t len, unsigned int flags) { struct pipe_buffer *ibuf, *obuf; int ret = 0, i = 0, nbuf; diff --git a/include/linux/splice.h b/include/linux/splice.h index 18e7c7c..431662c 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *, extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *, splice_direct_actor *); +extern int link_pipe(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + size_t len, unsigned int flags); +extern long do_splice_to(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags); +extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out, + loff_t *ppos, size_t len, unsigned int flags); + #endif -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 51/96] c/r: support for open pipes [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (8 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan ` (6 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger A pipe is a double-headed inode with a buffer attached to it. We checkpoint the pipe buffer only once, as soon as we hit one side of the pipe, regardless whether it is read- or write- end. To checkpoint a file descriptor that refers to a pipe (either end), we first lookup the inode in the hash table: If not found, it is the first encounter of this pipe. Besides the file descriptor, we also (a) save the pipe data, and (b) register the pipe inode in the hash. If found, it is the second encounter of this pipe, namely, as we hit the other end of the same pipe. In both cases we write the pipe-objref of the inode. To restore, create a new pipe and thus have two file pointers (read- and write- ends). We only use one of them, depending on which side was checkpointed first. We register the file pointer of the other end in the hash table, with the pipe_objref given for this pipe from the checkpoint, to be used later when the other arrives. At this point we also restore the contents of the pipe buffers. To save the pipe buffer, given a source pipe, use do_tee() to clone its contents into a temporary 'struct pipe_inode_info', and then use do_splice_from() to transfer it directly to the checkpoint image file. To restore the pipe buffer, with a fresh newly allocated target pipe, use do_splice_to() to splice the data directly between the checkpoint image file and the pipe. Changelog[v19-rc1]: - Switch to ckpt_obj_try_fetch() - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Adjust format of pipe buffer to include the mandatory pre-header Changelog[v17]: - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 7 ++ fs/pipe.c | 157 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint_hdr.h | 9 +++ include/linux/pipe_fs_i.h | 8 ++ 4 files changed, 181 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index b404c8f..1c294fe 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -17,6 +17,7 @@ #include <linux/file.h> #include <linux/fdtable.h> #include <linux/fsnotify.h> +#include <linux/pipe_fs_i.h> #include <linux/syscalls.h> #include <linux/deferqueue.h> #include <linux/checkpoint.h> @@ -592,6 +593,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_GENERIC, .restore = generic_file_restore, }, + /* pipes */ + { + .file_name = "PIPE", + .file_type = CKPT_FILE_PIPE, + .restore = pipe_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/pipe.c b/fs/pipe.c index 37ba29f..747b2d7 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -13,11 +13,13 @@ #include <linux/fs.h> #include <linux/mount.h> #include <linux/pipe_fs_i.h> +#include <linux/splice.h> #include <linux/uio.h> #include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/audit.h> #include <linux/syscalls.h> +#include <linux/checkpoint.h> #include <asm/uaccess.h> #include <asm/ioctls.h> @@ -828,6 +830,158 @@ pipe_rdwr_open(struct inode *inode, struct file *filp) return ret; } +#ifdef CONFIG_CHECKPOINT +static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode) +{ + struct pipe_inode_info *pipe; + int len, ret = -ENOMEM; + + pipe = alloc_pipe_info(NULL); + if (!pipe) + return ret; + + pipe->readers = 1; /* bluff link_pipe() below */ + len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK); + if (len == -EAGAIN) + len = 0; + if (len < 0) { + ret = len; + goto out; + } + + ret = ckpt_write_obj_type(ctx, NULL, len, CKPT_HDR_PIPE_BUF); + if (ret < 0) + goto out; + + ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0); + if (ret < 0) + goto out; + if (ret != len) + ret = -EPIPE; /* can occur due to an error in target file */ + out: + __free_pipe_info(pipe); + return ret; +} + +static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_pipe *h; + struct inode *inode = file->f_dentry->d_inode; + int objref, first, ret; + + objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first); + if (objref < 0) + return objref; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_PIPE; + h->pipe_objref = objref; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->common.h); + if (ret < 0) + goto out; + + if (first) + ret = checkpoint_pipe(ctx, inode); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static int restore_pipe(struct ckpt_ctx *ctx, struct file *file) +{ + struct pipe_inode_info *pipe; + int len, ret; + + len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_PIPE_BUF); + if (len < 0) + return len; + + pipe = file->f_dentry->d_inode->i_pipe; + ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0); + + if (ret >= 0 && ret != len) + ret = -EPIPE; /* can occur due to an error in source file */ + + return ret; +} + +struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr) +{ + struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr; + struct file *file; + int fds[2], which, ret; + + if (ptr->h.type != CKPT_HDR_FILE || + ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE) + return ERR_PTR(-EINVAL); + + if (h->pipe_objref <= 0) + return ERR_PTR(-EINVAL); + + file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE); + /* + * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), then this is + * the first time we see this pipe so need to restore the + * contents. Otherwise, use the file pointer skip forward. + */ + if (!IS_ERR(file)) { + get_file(file); + } else if (PTR_ERR(file) == -EINVAL) { + /* first encounter of this pipe: create it */ + ret = do_pipe_flags(fds, 0); + if (ret < 0) + return file; + + which = (ptr->f_flags & O_WRONLY ? 1 : 0); + /* + * Below we return the file corersponding to one side + * of the pipe for our caller to use. Now insert the + * other side of the pipe to the hash, to be picked up + * when that side is restored. + */ + file = fget(fds[1-which]); /* the 'other' side */ + if (!file) /* this should _never_ happen ! */ + return ERR_PTR(-EBADF); + ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE); + if (ret < 0) + goto out; + + ret = restore_pipe(ctx, file); + fput(file); + if (ret < 0) + return ERR_PTR(ret); + + file = fget(fds[which]); /* 'this' side */ + if (!file) /* this should _never_ happen ! */ + return ERR_PTR(-EBADF); + + /* get rid of the file descriptors (caller sets that) */ + sys_close(fds[which]); + sys_close(fds[1-which]); + } else { + return file; + } + + ret = restore_file_common(ctx, file, ptr); + out: + if (ret < 0) { + fput(file); + file = ERR_PTR(ret); + } + + return file; +} +#else +#define pipe_file_checkpoint NULL +#endif /* CONFIG_CHECKPOINT */ + /* * The file_operations structs are not static because they * are also used in linux/fs/fifo.c to do operations on FIFOs. @@ -844,6 +998,7 @@ const struct file_operations read_pipefifo_fops = { .open = pipe_read_open, .release = pipe_read_release, .fasync = pipe_read_fasync, + .checkpoint = pipe_file_checkpoint, }; const struct file_operations write_pipefifo_fops = { @@ -856,6 +1011,7 @@ const struct file_operations write_pipefifo_fops = { .open = pipe_write_open, .release = pipe_write_release, .fasync = pipe_write_fasync, + .checkpoint = pipe_file_checkpoint, }; const struct file_operations rdwr_pipefifo_fops = { @@ -869,6 +1025,7 @@ const struct file_operations rdwr_pipefifo_fops = { .open = pipe_rdwr_open, .release = pipe_rdwr_release, .fasync = pipe_rdwr_fasync, + .checkpoint = pipe_file_checkpoint, }; struct pipe_inode_info * alloc_pipe_info(struct inode *inode) diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 6fae6ef..885d06b 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -90,6 +90,8 @@ enum { #define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME CKPT_HDR_FILE, #define CKPT_HDR_FILE CKPT_HDR_FILE + CKPT_HDR_PIPE_BUF, +#define CKPT_HDR_PIPE_BUF CKPT_HDR_PIPE_BUF CKPT_HDR_MM = 401, #define CKPT_HDR_MM CKPT_HDR_MM @@ -277,6 +279,8 @@ enum file_type { #define CKPT_FILE_IGNORE CKPT_FILE_IGNORE CKPT_FILE_GENERIC, #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC + CKPT_FILE_PIPE, +#define CKPT_FILE_PIPE CKPT_FILE_PIPE CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; @@ -296,6 +300,11 @@ struct ckpt_hdr_file_generic { struct ckpt_hdr_file common; } __attribute__((aligned(8))); +struct ckpt_hdr_file_pipe { + struct ckpt_hdr_file common; + __s32 pipe_objref; +} __attribute__((aligned(8))); + /* memory layout */ struct ckpt_hdr_mm { struct ckpt_hdr h; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index b43a9e0..e526a12 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *); void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *); +/* checkpoint/restart */ +#ifdef CONFIG_CHECKPOINT +struct ckpt_ctx; +struct ckpt_hdr_file; +extern struct file *pipe_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr); +#endif + #endif -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (9 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan ` (5 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger FIFOs are almost like pipes. Checkpoints adds the FIFO pathname. The first time the FIFO is found it also assigns an @objref and dumps the contents in the buffers. To restore, use the @objref only to determine whether a particular FIFO has already been restored earlier. Note that it ignores the file pointer that matches that @objref (unlike with pipes, where that file corresponds to the other end of the pipe). Instead, it creates a new FIFO using the saved pathname. Changelog [v19-rc3]: - Rebase to kernel 2.6.33 Changelog [v19-rc1]: - Switch to ckpt_obj_try_fetch() - [Matt Helsley] Add cpp definitions for enums Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 6 +++ fs/pipe.c | 81 +++++++++++++++++++++++++++++++++++++++- include/linux/checkpoint_hdr.h | 2 + include/linux/pipe_fs_i.h | 2 + 4 files changed, 90 insertions(+), 1 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 1c294fe..c647bfd 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -599,6 +599,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_PIPE, .restore = pipe_file_restore, }, + /* fifo */ + { + .file_name = "FIFO", + .file_type = CKPT_FILE_FIFO, + .restore = fifo_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/pipe.c b/fs/pipe.c index 747b2d7..8c79493 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -830,6 +830,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp) return ret; } +static struct vfsmount *pipe_mnt __read_mostly; + #ifdef CONFIG_CHECKPOINT static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode) { @@ -877,7 +879,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) if (!h) return -ENOMEM; - h->common.f_type = CKPT_FILE_PIPE; + /* fifo and pipe are similar at checkpoint, differ on restore */ + if (inode->i_sb == pipe_mnt->mnt_sb) + h->common.f_type = CKPT_FILE_PIPE; + else + h->common.f_type = CKPT_FILE_FIFO; h->pipe_objref = objref; ret = checkpoint_file_common(ctx, file, &h->common); @@ -887,6 +893,13 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) if (ret < 0) goto out; + /* FIFO also needs a file name */ + if (h->common.f_type == CKPT_FILE_FIFO) { + ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); + if (ret < 0) + goto out; + } + if (first) ret = checkpoint_pipe(ctx, inode); out: @@ -978,8 +991,74 @@ struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr) return file; } + +struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr) +{ + struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr; + struct file *file; + int first, ret; + + if (ptr->h.type != CKPT_HDR_FILE || + ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_FIFO) + return ERR_PTR(-EINVAL); + + if (h->pipe_objref <= 0) + return ERR_PTR(-EINVAL); + + /* + * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), this is the + * first time for this fifo. + */ + file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE); + if (!IS_ERR(file)) + first = 0; + else if (PTR_ERR(file) == -EINVAL) + first = 1; + else + return file; + + /* + * To avoid blocking, always open the fifo with O_RDWR; + * then fix flags below. + */ + file = restore_open_fname(ctx, (ptr->f_flags & ~O_ACCMODE) | O_RDWR); + if (IS_ERR(file)) + return file; + + if ((ptr->f_flags & O_ACCMODE) == O_RDONLY) { + file->f_flags = (file->f_flags & ~O_ACCMODE) | O_RDONLY; + file->f_mode &= ~FMODE_WRITE; + } else if ((ptr->f_flags & O_ACCMODE) == O_WRONLY) { + file->f_flags = (file->f_flags & ~O_ACCMODE) | O_WRONLY; + file->f_mode &= ~FMODE_READ; + } else if ((ptr->f_flags & O_ACCMODE) != O_RDWR) { + ret = -EINVAL; + goto out; + } + + /* first time: add to objhash and restore fifo's contents */ + if (first) { + ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE); + if (ret < 0) + goto out; + + ret = restore_pipe(ctx, file); + if (ret < 0) + goto out; + } + + ret = restore_file_common(ctx, file, ptr); + out: + if (ret < 0) { + fput(file); + file = ERR_PTR(ret); + } + + return file; +} #else #define pipe_file_checkpoint NULL +#define fifo_file_checkpoint NULL #endif /* CONFIG_CHECKPOINT */ /* diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 885d06b..fce35f3 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -281,6 +281,8 @@ enum file_type { #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC CKPT_FILE_PIPE, #define CKPT_FILE_PIPE CKPT_FILE_PIPE + CKPT_FILE_FIFO, +#define CKPT_FILE_FIFO CKPT_FILE_FIFO CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index e526a12..596403e 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -160,6 +160,8 @@ struct ckpt_ctx; struct ckpt_hdr_file; extern struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr); +extern struct file *fifo_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr); #endif #endif -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (10 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan ` (4 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> We do not support restarting fsnotify watches. inotify and fanotify utilize anon_inodes for pseudofiles which lack the .checkpoint operation. So they already cleanly prevent checkpoint. dnotify on the other hand registers its watches using fcntl() which does not require the userspace task to hold an fd with an empty .checkpoint operation. This means userspace could use dnotify to set up fsnotify watches which won't be re-created during restart. Check for fsnotify watches created with dnotify and reject checkpoint if there are any. Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 5 +++++ fs/notify/dnotify/dnotify.c | 18 ++++++++++++++++++ include/linux/dnotify.h | 6 ++++++ 3 files changed, 29 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index c647bfd..62feadd 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -207,6 +207,11 @@ int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) return -EBADF; } + if (is_dnotify_attached(file)) { + ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file); + return -EBADF; + } + ret = file->f_op->checkpoint(ctx, file); if (ret < 0) ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c index 7e54e52..0a63bf6 100644 --- a/fs/notify/dnotify/dnotify.c +++ b/fs/notify/dnotify/dnotify.c @@ -289,6 +289,24 @@ static int attach_dn(struct dnotify_struct *dn, struct dnotify_mark_entry *dnent return 0; } +int is_dnotify_attached(struct file *filp) +{ + struct fsnotify_mark_entry *entry; + struct inode *inode; + + inode = filp->f_path.dentry->d_inode; + if (!S_ISDIR(inode->i_mode)) + return 0; + + spin_lock(&inode->i_lock); + entry = fsnotify_find_mark_entry(dnotify_group, inode); + spin_unlock(&inode->i_lock); + if (!entry) + return 0; + fsnotify_put_mark(entry); + return 1; +} + /* * When a process calls fcntl to attach a dnotify watch to a directory it ends * up here. Allocate both a mark for fsnotify to add and a dnotify_struct to be diff --git a/include/linux/dnotify.h b/include/linux/dnotify.h index ecc0628..b9ce13c 100644 --- a/include/linux/dnotify.h +++ b/include/linux/dnotify.h @@ -29,6 +29,7 @@ struct dnotify_struct { FS_MOVED_FROM | FS_MOVED_TO) extern void dnotify_flush(struct file *, fl_owner_t); +extern int is_dnotify_attached(struct file *); extern int fcntl_dirnotify(int, struct file *, unsigned long); #else @@ -37,6 +38,11 @@ static inline void dnotify_flush(struct file *filp, fl_owner_t id) { } +static inline int is_dnotify_attached(struct file *) +{ + return 0; +} + static inline int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg) { return -EINVAL; -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 66/96] c/r: restore file->f_cred [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (11 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan ` (3 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Restore a file's f_cred. This is set to the cred of the task doing the open, so often it will be the same as that of the restarted task. Changelog[v1]: - [Nathan Lynch] discard const from struct cred * where appropriate Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- checkpoint/files.c | 18 ++++++++++++++++-- include/linux/checkpoint_hdr.h | 2 +- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 62feadd..63a611f 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -148,15 +148,21 @@ static int scan_fds(struct files_struct *files, int **fdtable) int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, struct ckpt_hdr_file *h) { + struct cred *f_cred = (struct cred *) file->f_cred; + h->f_flags = file->f_flags; h->f_mode = file->f_mode; h->f_pos = file->f_pos; h->f_version = file->f_version; + h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED); + if (h->f_credref < 0) + return h->f_credref; + ckpt_debug("file %s credref %d", file->f_dentry->d_name.name, h->f_credref); - /* FIX: need also file->uid, file->gid, file->f_owner, etc */ + /* FIX: need also file->f_owner, etc */ return 0; } @@ -522,8 +528,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file, fmode_t new_mode = file->f_mode; fmode_t saved_mode = (__force fmode_t) h->f_mode; int ret; + struct cred *cred; + + /* FIX: need to restore owner etc */ - /* FIX: need to restore uid, gid, owner etc */ + /* restore the cred */ + cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED); + if (IS_ERR(cred)) + return PTR_ERR(cred); + put_cred(file->f_cred); + file->f_cred = get_cred(cred); /* safe to set 1st arg (fd) to 0, as command is F_SETFL */ ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index cbccc81..729be96 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -432,7 +432,7 @@ struct ckpt_hdr_file { __u32 f_type; __u32 f_mode; __u32 f_flags; - __u32 _padding; + __s32 f_credref; __u64 f_pos; __u64 f_version; } __attribute__((aligned(8))); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (12 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan ` (2 subsequent siblings) 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Save/restore epoll items during checkpoint/restart respectively. Output the epoll header and items separately. Chunk the output much like the pid array gets chunked. This ensures that even sub-order 0 allocations will enable checkpoint of large epoll sets. A subsequent patch will do something similar for the restore path. On restart, we grab a piece of memory suitable to store a "chunk" of items for input. Read the input one chunk at a time and add epoll items for each item in the chunk. Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Changelog [v19]: - [Oren Laadan] Fix broken compilation for no-c/r architectures Changelog [v19-rc1]: - [Oren Laadan] Return -EBUSY (not BUG_ON) if fd is gone on restart - [Oren Laadan] Fix the chunk size instead of auto-tune Changelog v5: Fix potential recursion during collect. Replace call to ckpt_obj_collect() with ckpt_collect_file(). [Oren] Fix checkpoint leak detection when there are more items than expected. Cleanup/simplify error write paths. (will complicate in a later patch) [Oren] Remove files_deferq bits. [Oren] Remove extra newline. [Oren] Remove aggregate check on number of watches added. [Oren] This is OK since these will be done individually anyway. Remove check for negative objrefs during restart. [Oren] Fixup comment regarding race that indicates checkpoint leaks. [Oren] s/ckpt_read_obj/ckpt_read_buf_type/ [Oren] Patch for lots of epoll items follows. Moved sys_close(epfd) right under fget(). [Oren] Use CKPT_HDR_BUFFER rather than custome ckpt_read/write_* This makes it more similar to the pid array code. [Oren] It also simplifies the error recovery paths. Tested polling a pipe and 50,000 UNIX sockets. Changelog v4: ckpt-v18 Use files_deferq as submitted by Dan Smith Cleanup to only report >= 1 items when debugging. Changelog v3: [unposted] Removed most of the TODOs -- the remainder will be removed by subsequent patches. Fixed missing ep_file_collect() [Serge] Rather than include checkpoint_hdr.h declare (but do not define) the two structs needed in eventpoll.h [Oren] Complain with ckpt_write_err() when we detect checkpoint obj leaks. [Oren] Remove redundant is_epoll_file() check in collect. [Oren] Move epfile_objref lookup to simplify error handling. [Oren] Simplify error handling with early return in ep_eventpoll_checkpoint(). [Oren] Cleaned up a comment. [Oren] Shorten CKPT_HDR_FILE_EPOLL_ITEMS (-FILE) [Oren] Renumbered to indicate that it follows the file table. Renamed the epoll struct in checkpoint_hdr.h [Oren] Also renamed substruct. Fixup return of empty ep_file_restore(). [Oren] Changed some error returns. [Oren] Changed some tests to BUG_ON(). [Oren] Factored out watch insert with epoll_ctl() into do_epoll_ctl(). [Cedric, Oren] Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 7 + fs/eventpoll.c | 334 ++++++++++++++++++++++++++++++++++++---- include/linux/checkpoint_hdr.h | 18 ++ include/linux/eventpoll.h | 17 ++- 4 files changed, 347 insertions(+), 29 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index bcc1fbf..6aaaf22 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -22,6 +22,7 @@ #include <linux/deferqueue.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> +#include <linux/eventpoll.h> #include <net/sock.h> @@ -637,6 +638,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_TTY, .restore = tty_file_restore, }, + /* epoll */ + { + .file_name = "EPOLL", + .file_type = CKPT_FILE_EPOLL, + .restore = ep_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index bd056a5..7f1a091 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -39,6 +39,9 @@ #include <asm/mman.h> #include <asm/atomic.h> +#include <linux/checkpoint.h> +#include <linux/deferqueue.h> + /* * LOCKING: * There are three level of locking required by epoll : @@ -671,10 +674,20 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) return pollflags != -1 ? pollflags : 0; } +#ifdef CONFIG_CHECKPOINT +static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file); +static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file); +#else +#define ep_eventpoll_checkpoint NULL +#define ep_file_collect NULL +#endif + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { .release = ep_eventpoll_release, - .poll = ep_eventpoll_poll + .poll = ep_eventpoll_poll, + .checkpoint = ep_eventpoll_checkpoint, + .collect = ep_file_collect, }; /* Fast test to see if the file is an evenpoll file */ @@ -1226,35 +1239,18 @@ SYSCALL_DEFINE1(epoll_create, int, size) * the eventpoll file that enables the insertion/removal/change of * file descriptors inside the interest set. */ -SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, - struct epoll_event __user *, event) +int do_epoll_ctl(int op, int fd, + struct file *file, struct file *tfile, + struct epoll_event *epds) { int error; - struct file *file, *tfile; struct eventpoll *ep; struct epitem *epi; - struct epoll_event epds; - - error = -EFAULT; - if (ep_op_has_event(op) && - copy_from_user(&epds, event, sizeof(struct epoll_event))) - goto error_return; - - /* Get the "struct file *" for the eventpoll file */ - error = -EBADF; - file = fget(epfd); - if (!file) - goto error_return; - - /* Get the "struct file *" for the target file */ - tfile = fget(fd); - if (!tfile) - goto error_fput; /* The target file descriptor must support poll */ error = -EPERM; if (!tfile->f_op || !tfile->f_op->poll) - goto error_tgt_fput; + return error; /* * We have to check that the file structure underneath the file descriptor @@ -1263,7 +1259,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, */ error = -EINVAL; if (file == tfile || !is_file_epoll(file)) - goto error_tgt_fput; + return error; /* * At this point it is safe to assume that the "private_data" contains @@ -1284,8 +1280,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, switch (op) { case EPOLL_CTL_ADD: if (!epi) { - epds.events |= POLLERR | POLLHUP; - error = ep_insert(ep, &epds, tfile, fd); + epds->events |= POLLERR | POLLHUP; + error = ep_insert(ep, epds, tfile, fd); } else error = -EEXIST; break; @@ -1297,15 +1293,46 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, break; case EPOLL_CTL_MOD: if (epi) { - epds.events |= POLLERR | POLLHUP; - error = ep_modify(ep, epi, &epds); + epds->events |= POLLERR | POLLHUP; + error = ep_modify(ep, epi, epds); } else error = -ENOENT; break; } mutex_unlock(&ep->mtx); -error_tgt_fput: + return error; +} + +/* + * The following function implements the controller interface for + * the eventpoll file that enables the insertion/removal/change of + * file descriptors inside the interest set. + */ +SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, + struct epoll_event __user *, event) +{ + int error; + struct file *file, *tfile; + struct epoll_event epds; + + error = -EFAULT; + if (ep_op_has_event(op) && + copy_from_user(&epds, event, sizeof(struct epoll_event))) + goto error_return; + + /* Get the "struct file *" for the eventpoll file */ + error = -EBADF; + file = fget(epfd); + if (!file) + goto error_return; + + /* Get the "struct file *" for the target file */ + tfile = fget(fd); + if (!tfile) + goto error_fput; + + error = do_epoll_ctl(op, fd, file, tfile, &epds); fput(tfile); error_fput: fput(file); @@ -1413,6 +1440,257 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, #endif /* HAVE_SET_RESTORE_SIGMASK */ +#ifdef CONFIG_CHECKPOINT +static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file) +{ + struct rb_node *rbp; + struct eventpoll *ep; + int ret = 0; + + ep = file->private_data; + mutex_lock(&ep->mtx); + for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) { + struct epitem *epi; + + epi = rb_entry(rbp, struct epitem, rbn); + if (is_file_epoll(epi->ffd.file)) + continue; /* Don't recurse */ + ret = ckpt_collect_file(ctx, epi->ffd.file); + if (ret < 0) + break; + } + mutex_unlock(&ep->mtx); + return ret; +} + +struct epoll_deferq_entry { + struct ckpt_ctx *ctx; + struct file *epfile; +}; + +#define CKPT_EPOLL_CHUNK (8096 / (int) sizeof(struct ckpt_eventpoll_item)) + +static int ep_items_checkpoint(void *data) +{ + struct epoll_deferq_entry *dq_entry = data; + struct ckpt_ctx *ctx; + struct ckpt_hdr_eventpoll_items *h; + struct ckpt_eventpoll_item *items; + struct rb_node *rbp; + struct eventpoll *ep; + __s32 epfile_objref; + int num_items = 0, ret; + + ctx = dq_entry->ctx; + + epfile_objref = ckpt_obj_lookup(ctx, dq_entry->epfile, CKPT_OBJ_FILE); + BUG_ON(epfile_objref <= 0); + + ep = dq_entry->epfile->private_data; + mutex_lock(&ep->mtx); + for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) + num_items++; + mutex_unlock(&ep->mtx); + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS); + if (!h) + return -ENOMEM; + h->num_items = num_items; + h->epfile_objref = epfile_objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret || !num_items) + return ret; + + ret = ckpt_write_obj_type(ctx, NULL, sizeof(*items)*num_items, + CKPT_HDR_BUFFER); + if (ret < 0) + return ret; + + items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL); + if (!items) + return -ENOMEM; + + /* + * Walk the rbtree copying items into the chunk of memory and then + * writing them to the checkpoint image + */ + ret = 0; + mutex_lock(&ep->mtx); + rbp = rb_first(&ep->rbr); + while ((num_items > 0) && rbp) { + int n = min(num_items, CKPT_EPOLL_CHUNK); + int j; + + for (j = 0; rbp && j < n; j++, rbp = rb_next(rbp)) { + struct epitem *epi; + int objref; + + epi = rb_entry(rbp, struct epitem, rbn); + items[j].fd = epi->ffd.fd; + items[j].events = epi->event.events; + items[j].data = epi->event.data; + objref = ckpt_obj_lookup(ctx, epi->ffd.file, + CKPT_OBJ_FILE); + if (objref <= 0) + goto unlock; + items[j].file_objref = objref; + } + ret = ckpt_kwrite(ctx, items, n*sizeof(*items)); + if (ret < 0) + break; + num_items -= n; + } +unlock: + mutex_unlock(&ep->mtx); + kfree(items); + if (num_items != 0 || (num_items == 0 && rbp)) + ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */ + if (ret) + ckpt_err(ctx, ret, "Checkpointing epoll items.\n"); + return ret; +} + +static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file *h; + struct epoll_deferq_entry dq_entry; + int ret = -ENOMEM; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + h->f_type = CKPT_FILE_EPOLL; + ret = checkpoint_file_common(ctx, file, h); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->h); + if (ret < 0) + goto out; + + /* + * Defer saving the epoll items until all of the ffd.file pointers + * have an objref; after the file table has been checkpointed. + */ + dq_entry.ctx = ctx; + dq_entry.epfile = file; + ret = deferqueue_add(ctx->files_deferq, &dq_entry, + sizeof(dq_entry), ep_items_checkpoint, NULL); +out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static int ep_items_restore(void *data) +{ + struct ckpt_ctx *ctx = deferqueue_data_ptr(data); + struct ckpt_hdr_eventpoll_items *h; + struct ckpt_eventpoll_item *items = NULL; + struct eventpoll *ep; + struct file *epfile = NULL; + int ret, num_items; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS); + if (IS_ERR(h)) + return PTR_ERR(h); + num_items = h->num_items; + epfile = ckpt_obj_fetch(ctx, h->epfile_objref, CKPT_OBJ_FILE); + ckpt_hdr_put(ctx, h); + + /* Make sure userspace didn't give us a ref to a non-epoll file. */ + if (IS_ERR(epfile)) + return PTR_ERR(epfile); + if (!is_file_epoll(epfile)) + return -EINVAL; + if (!num_items) + return 0; + + ret = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER); + if (ret < 0) + return ret; + /* Make sure the items match the size we expect */ + if (num_items != (ret / sizeof(*items))) + return -EINVAL; + + items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL); + if (!items) + return -ENOMEM; + + ep = epfile->private_data; + + while (num_items > 0) { + int n = min(num_items, CKPT_EPOLL_CHUNK); + int j; + + ret = ckpt_kread(ctx, items, n*sizeof(*items)); + if (ret < 0) + break; + + /* Restore the epoll items/watches */ + for (j = 0; !ret && j < n; j++) { + struct epoll_event epev; + struct file *tfile; + + tfile = ckpt_obj_fetch(ctx, items[j].file_objref, + CKPT_OBJ_FILE); + if (IS_ERR(tfile)) { + ret = PTR_ERR(tfile); + goto out; + } + epev.events = items[j].events; + epev.data = items[j].data; + ret = do_epoll_ctl(EPOLL_CTL_ADD, items[j].fd, + epfile, tfile, &epev); + } + num_items -= n; + } +out: + kfree(items); + return ret; +} + +struct file *ep_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *h) +{ + struct file *epfile; + int epfd, ret; + + if (h->h.type != CKPT_HDR_FILE || + h->h.len != sizeof(*h) || + h->f_type != CKPT_FILE_EPOLL) + return ERR_PTR(-EINVAL); + + epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC); + if (epfd < 0) + return ERR_PTR(epfd); + epfile = fget(epfd); + sys_close(epfd); /* harmless even if an error occured */ + if (!epfile) /* can happen with a malicious user */ + return ERR_PTR(-EBUSY); + + /* + * Needed before we can properly restore the watches and enforce the + * limit on watch numbers. + */ + ret = restore_file_common(ctx, epfile, h); + if (ret < 0) + goto fput_out; + + /* + * Defer restoring the epoll items until the file table is + * fully restored. Ensures that valid file objrefs will resolve. + */ + ret = deferqueue_add_ptr(ctx->files_deferq, ctx, ep_items_restore, NULL); + if (ret < 0) { +fput_out: + fput(epfile); + epfile = ERR_PTR(ret); + } + return epfile; +} + +#endif /* CONFIG_CHECKPOINT */ + static int __init eventpoll_init(void) { struct sysinfo si; diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 4fe63b1..b96d2dc 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -119,6 +119,8 @@ enum { #define CKPT_HDR_TTY CKPT_HDR_TTY CKPT_HDR_TTY_LDISC, #define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC + CKPT_HDR_EPOLL_ITEMS, /* must be after file-table */ +#define CKPT_HDR_EPOLL_ITEMS CKPT_HDR_EPOLL_ITEMS CKPT_HDR_MM = 401, #define CKPT_HDR_MM CKPT_HDR_MM @@ -477,6 +479,8 @@ enum file_type { #define CKPT_FILE_SOCKET CKPT_FILE_SOCKET CKPT_FILE_TTY, #define CKPT_FILE_TTY CKPT_FILE_TTY + CKPT_FILE_EPOLL, +#define CKPT_FILE_EPOLL CKPT_FILE_EPOLL CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; @@ -693,6 +697,20 @@ struct ckpt_hdr_file_socket { __s32 sock_objref; } __attribute__((aligned(8))); +struct ckpt_hdr_eventpoll_items { + struct ckpt_hdr h; + __s32 epfile_objref; + __u32 num_items; +} __attribute__((aligned(8))); + +/* Contained in a CKPT_HDR_BUFFER following the ckpt_hdr_eventpoll_items */ +struct ckpt_eventpoll_item { + __u64 data; + __u32 fd; + __s32 file_objref; + __u32 events; +} __attribute__((aligned(8))); + /* memory layout */ struct ckpt_hdr_mm { struct ckpt_hdr h; diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index f6856a5..52282ae 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -56,6 +56,9 @@ struct file; #ifdef CONFIG_EPOLL +struct ckpt_ctx; +struct ckpt_hdr_file; + /* Used to initialize the epoll bits inside the "struct file" */ static inline void eventpoll_init_file(struct file *file) @@ -95,11 +98,23 @@ static inline void eventpoll_release(struct file *file) eventpoll_release_file(file); } -#else +#ifdef CONFIG_CHECKPOINT +extern struct file *ep_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *h); +#endif +#else +/* !defined(CONFIG_EPOLL) */ static inline void eventpoll_init_file(struct file *file) {} static inline void eventpoll_release(struct file *file) {} +#ifdef CONFIG_CHECKPOINT +static inline struct file *ep_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr) +{ + return ERR_PTR(-ENOSYS); +} +#endif #endif #endif /* #ifdef __KERNEL__ */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (13 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan 2010-03-19 1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Save/restore eventfd files. These are anon_inodes just like epoll but instead of a set of files to poll they are a 64-bit counter and a flag value. Used for AIO. [Oren Laadan] Added #ifdef's around checkpoint/restart to compile even without CONFIG_CHECKPOINT Changelog[v19]: - Fix broken compilation for architectures that don't support c/r Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 7 +++++ fs/eventfd.c | 55 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint_hdr.h | 8 ++++++ include/linux/eventfd.h | 12 ++++++++ 4 files changed, 82 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 6aaaf22..4b551fe 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -23,6 +23,7 @@ #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> #include <linux/eventpoll.h> +#include <linux/eventfd.h> #include <net/sock.h> @@ -644,6 +645,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_EPOLL, .restore = ep_file_restore, }, + /* eventfd */ + { + .file_name = "EVENTFD", + .file_type = CKPT_FILE_EVENTFD, + .restore = eventfd_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/eventfd.c b/fs/eventfd.c index 7758cc3..f2785c0 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -18,6 +18,7 @@ #include <linux/module.h> #include <linux/kref.h> #include <linux/eventfd.h> +#include <linux/checkpoint.h> struct eventfd_ctx { struct kref kref; @@ -287,11 +288,65 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c return res; } +#ifdef CONFIG_CHECKPOINT +static int eventfd_checkpoint(struct ckpt_ctx *ckpt_ctx, struct file *file) +{ + struct eventfd_ctx *ctx; + struct ckpt_hdr_file_eventfd *h; + int ret = -ENOMEM; + + h = ckpt_hdr_get_type(ckpt_ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + h->common.f_type = CKPT_FILE_EVENTFD; + ret = checkpoint_file_common(ckpt_ctx, file, &h->common); + if (ret < 0) + goto out; + ctx = file->private_data; + h->count = ctx->count; + h->flags = ctx->flags; + ret = ckpt_write_obj(ckpt_ctx, &h->common.h); +out: + ckpt_hdr_put(ckpt_ctx, h); + return ret; +} + +struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx, + struct ckpt_hdr_file *ptr) +{ + struct ckpt_hdr_file_eventfd *h = (struct ckpt_hdr_file_eventfd *) ptr; + struct file *evfile; + int evfd, ret; + + /* Already know type == CKPT_HDR_FILE and f_type == CKPT_FILE_EVENTFD */ + if (h->common.h.len != sizeof(*h)) + return ERR_PTR(-EINVAL); + + evfd = sys_eventfd2(h->count, h->flags); + if (evfd < 0) + return ERR_PTR(evfd); + evfile = fget(evfd); + sys_close(evfd); + if (!evfile) + return ERR_PTR(-EBUSY); + + ret = restore_file_common(ckpt_ctx, evfile, &h->common); + if (ret < 0) { + fput(evfile); + return ERR_PTR(ret); + } + return evfile; +} +#else +#define eventfd_checkpoint NULL +#endif + static const struct file_operations eventfd_fops = { .release = eventfd_release, .poll = eventfd_poll, .read = eventfd_read, .write = eventfd_write, + .checkpoint = eventfd_checkpoint, }; /** diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index b96d2dc..0b36430 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -481,6 +481,8 @@ enum file_type { #define CKPT_FILE_TTY CKPT_FILE_TTY CKPT_FILE_EPOLL, #define CKPT_FILE_EPOLL CKPT_FILE_EPOLL + CKPT_FILE_EVENTFD, +#define CKPT_FILE_EVENTFD CKPT_FILE_EVENTFD CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; @@ -505,6 +507,12 @@ struct ckpt_hdr_file_pipe { __s32 pipe_objref; } __attribute__((aligned(8))); +struct ckpt_hdr_file_eventfd { + struct ckpt_hdr_file common; + __u64 count; + __u32 flags; +} __attribute__((aligned(8))); + /* socket */ struct ckpt_hdr_socket { struct ckpt_hdr h; diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 91bb4f2..2ce8525 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -39,6 +39,16 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt); int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, __u64 *cnt); +#ifdef CONFIG_CHECKPOINT +struct ckpt_ctx; +struct ckpt_hdr_file; + +struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx, + struct ckpt_hdr_file *ptr); +#else +#define eventfd_restore NULL +#endif + #else /* CONFIG_EVENTFD */ /* @@ -77,6 +87,8 @@ static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, return -ENOSYS; } +#define eventfd_restore NULL + #endif #endif /* _LINUX_EVENTFD_H */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (14 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan @ 2010-03-19 1:00 ` Oren Laadan 2010-03-19 1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 1:00 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger Checkpoint and restore task->fs. Tasks sharing task->fs will share them again after restart. Original patch by Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Changelog: Jan 25: [orenl] Addressed comments by .. myself: - add leak detection - change order of save/restore of chroot and cwd - save/restore fs only after file-table and mm - rename functions to adapt existing conventions Dec 28: [serge] Addressed comments by Oren (and Dave) - define and use {get,put}_fs_struct helpers - fix locking comment - define ckpt_read_fname() and use in checkpoint/files.c Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 203 +++++++++++++++++++++++++++++++++++++++- checkpoint/objhash.c | 34 +++++++ checkpoint/process.c | 17 ++++ fs/fs_struct.c | 21 ++++ fs/open.c | 58 +++++++----- include/linux/checkpoint.h | 8 ++- include/linux/checkpoint_hdr.h | 12 +++ include/linux/fs.h | 4 + include/linux/fs_struct.h | 2 + 9 files changed, 331 insertions(+), 28 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 4b551fe..7855bae 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -15,6 +15,9 @@ #include <linux/module.h> #include <linux/sched.h> #include <linux/file.h> +#include <linux/namei.h> +#include <linux/fs_struct.h> +#include <linux/fs.h> #include <linux/fdtable.h> #include <linux/fsnotify.h> #include <linux/pipe_fs_i.h> @@ -374,6 +377,62 @@ int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) return objref; } +int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct fs_struct *fs; + int fs_objref; + + task_lock(current); + fs = t->fs; + get_fs_struct(fs); + task_unlock(current); + + fs_objref = checkpoint_obj(ctx, fs, CKPT_OBJ_FS); + put_fs_struct(fs); + + return fs_objref; +} + +/* called with fs refcount bumped so it won't disappear */ +static int do_checkpoint_fs(struct ckpt_ctx *ctx, struct fs_struct *fs) +{ + struct ckpt_hdr_fs *h; + struct fs_struct *fscopy; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FS); + if (!h) + return -ENOMEM; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret) + return ret; + + fscopy = copy_fs_struct(fs); + if (!fs) + return -ENOMEM; + + ret = checkpoint_fname(ctx, &fscopy->pwd, &ctx->root_fs_path); + if (ret < 0) { + ckpt_err(ctx, ret, "%(T)writing path of cwd"); + goto out; + } + ret = checkpoint_fname(ctx, &fscopy->root, &ctx->root_fs_path); + if (ret < 0) { + ckpt_err(ctx, ret, "%(T)writing path of fs root"); + goto out; + } + ret = 0; + out: + free_fs_struct(fscopy); + return ret; +} + +int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_fs(ctx, (struct fs_struct *) ptr); +} + /*********************************************************************** * Collect */ @@ -460,10 +519,41 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } +int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct fs_struct *fs; + int ret; + + task_lock(t); + fs = t->fs; + get_fs_struct(fs); + task_unlock(t); + + ret = ckpt_obj_collect(ctx, fs, CKPT_OBJ_FS); + + put_fs_struct(fs); + return ret; +} + /************************************************************************** * Restart */ +static int ckpt_read_fname(struct ckpt_ctx *ctx, char **fname) +{ + int len; + + len = ckpt_read_payload(ctx, (void **) fname, + PATH_MAX, CKPT_HDR_FILE_NAME); + if (len < 0) + return len; + + (*fname)[len - 1] = '\0'; /* always play if safe */ + ckpt_debug("read filename '%s'\n", *fname); + + return len; +} + /** * restore_open_fname - read a file name and open a file * @ctx: checkpoint context @@ -479,11 +569,9 @@ struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags) if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC)) return ERR_PTR(-EINVAL); - len = ckpt_read_payload(ctx, (void **) &fname, - PATH_MAX, CKPT_HDR_FILE_NAME); + len = ckpt_read_fname(ctx, &fname); if (len < 0) return ERR_PTR(len); - fname[len - 1] = '\0'; /* always play if safe */ ckpt_debug("fname '%s' flags %#x\n", fname, flags); file = filp_open(fname, flags, 0); @@ -819,3 +907,112 @@ int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref) return 0; } + +/* + * Called by task restore code to set the restarted task's + * current->fs to an entry on the hash + */ +int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref) +{ + struct fs_struct *newfs, *oldfs; + + newfs = ckpt_obj_fetch(ctx, fs_objref, CKPT_OBJ_FS); + if (IS_ERR(newfs)) + return PTR_ERR(newfs); + + task_lock(current); + get_fs_struct(newfs); + oldfs = current->fs; + current->fs = newfs; + task_unlock(current); + put_fs_struct(oldfs); + + return 0; +} + +static int restore_chroot(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name) +{ + struct nameidata nd; + int ret; + + ckpt_debug("attempting chroot to %s\n", name); + ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd); + if (ret) { + ckpt_err(ctx, ret, "%(T)Opening chroot dir %s", name); + return ret; + } + ret = do_chroot(fs, &nd.path); + path_put(&nd.path); + if (ret) { + ckpt_err(ctx, ret, "%(T)Setting chroot %s", name); + return ret; + } + return 0; +} + +static int restore_cwd(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name) +{ + struct nameidata nd; + int ret; + + ckpt_debug("attempting chdir to %s\n", name); + ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd); + if (ret) { + ckpt_err(ctx, ret, "%(T)Opening cwd %s", name); + return ret; + } + ret = do_chdir(fs, &nd.path); + path_put(&nd.path); + if (ret) { + ckpt_err(ctx, ret, "%(T)Setting cwd %s", name); + return ret; + } + return 0; +} + +/* + * Called by objhash when it runs into a CKPT_OBJ_FS entry. Creates + * an fs_struct with desired chroot/cwd and places it in the hash. + */ +static struct fs_struct *do_restore_fs(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_fs *h; + struct fs_struct *fs; + char *path; + int ret = 0; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FS); + if (IS_ERR(h)) + return ERR_PTR(PTR_ERR(h)); + ckpt_hdr_put(ctx, h); + + fs = copy_fs_struct(current->fs); + if (!fs) + return ERR_PTR(-ENOMEM); + + ret = ckpt_read_fname(ctx, &path); + if (ret < 0) + goto out; + ret = restore_cwd(ctx, fs, path); + kfree(path); + if (ret) + goto out; + + ret = ckpt_read_fname(ctx, &path); + if (ret < 0) + goto out; + ret = restore_chroot(ctx, fs, path); + kfree(path); + +out: + if (ret) { + free_fs_struct(fs); + return ERR_PTR(ret); + } + return fs; +} + +void *restore_fs(struct ckpt_ctx *ctx) +{ + return (void *) do_restore_fs(ctx); +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 84bceec..5c4749d 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -15,6 +15,7 @@ #include <linux/hash.h> #include <linux/file.h> #include <linux/fdtable.h> +#include <linux/fs_struct.h> #include <linux/sched.h> #include <linux/ipc_namespace.h> #include <linux/user_namespace.h> @@ -126,6 +127,29 @@ static int obj_mm_users(void *ptr) return atomic_read(&((struct mm_struct *) ptr)->mm_users); } +static int obj_fs_grab(void *ptr) +{ + get_fs_struct((struct fs_struct *) ptr); + return 0; +} + +static void obj_fs_drop(void *ptr, int lastref) +{ + put_fs_struct((struct fs_struct *) ptr); +} + +static int obj_fs_users(void *ptr) +{ + /* + * It's safe to not use fs->lock because the fs referenced. + * It's also sufficient for leak detection: with no leak the + * count can't change; with a leak it will be too big already + * (even if it's about to grow), and if it's about to shrink + * then it's as if we sampled the count a bit earlier. + */ + return ((struct fs_struct *) ptr)->users; +} + static int obj_sighand_grab(void *ptr) { atomic_inc(&((struct sighand_struct *) ptr)->count); @@ -330,6 +354,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_mm, .restore = restore_mm, }, + /* fs object */ + { + .obj_name = "FS", + .obj_type = CKPT_OBJ_FS, + .ref_drop = obj_fs_drop, + .ref_grab = obj_fs_grab, + .ref_users = obj_fs_users, + .checkpoint = checkpoint_fs, + .restore = restore_fs, + }, /* sighand object */ { .obj_name = "SIGHAND", diff --git a/checkpoint/process.c b/checkpoint/process.c index e0ef795..f917112 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -232,6 +232,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) struct ckpt_hdr_task_objs *h; int files_objref; int mm_objref; + int fs_objref; int sighand_objref; int signal_objref; int first, ret; @@ -272,6 +273,13 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) return mm_objref; } + /* note: this must come *after* file-table and mm */ + fs_objref = checkpoint_obj_fs(ctx, t); + if (fs_objref < 0) { + ckpt_err(ctx, fs_objref, "%(T)process fs\n"); + return fs_objref; + } + sighand_objref = checkpoint_obj_sighand(ctx, t); ckpt_debug("sighand: objref %d\n", sighand_objref); if (sighand_objref < 0) { @@ -299,6 +307,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) return -ENOMEM; h->files_objref = files_objref; h->mm_objref = mm_objref; + h->fs_objref = fs_objref; h->sighand_objref = sighand_objref; h->signal_objref = signal_objref; ret = ckpt_write_obj(ctx, &h->h); @@ -477,6 +486,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) ret = ckpt_collect_mm(ctx, t); if (ret < 0) return ret; + ret = ckpt_collect_fs(ctx, t); + if (ret < 0) + return ret; ret = ckpt_collect_sighand(ctx, t); return ret; @@ -645,6 +657,11 @@ static int restore_task_objs(struct ckpt_ctx *ctx) if (ret < 0) goto out; + ret = restore_obj_fs(ctx, h->fs_objref); + ckpt_debug("fs: ret %d (%p)\n", ret, current->fs); + if (ret < 0) + return ret; + ret = restore_obj_sighand(ctx, h->sighand_objref); ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand); if (ret < 0) diff --git a/fs/fs_struct.c b/fs/fs_struct.c index eee0590..2a4c6f5 100644 --- a/fs/fs_struct.c +++ b/fs/fs_struct.c @@ -6,6 +6,27 @@ #include <linux/fs_struct.h> /* + * call with owning task locked + */ +void get_fs_struct(struct fs_struct *fs) +{ + write_lock(&fs->lock); + fs->users++; + write_unlock(&fs->lock); +} + +void put_fs_struct(struct fs_struct *fs) +{ + int kill; + + write_lock(&fs->lock); + kill = !--fs->users; + write_unlock(&fs->lock); + if (kill) + free_fs_struct(fs); +} + +/* * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values. * It can block. */ diff --git a/fs/open.c b/fs/open.c index 040cef7..62fc70c 100644 --- a/fs/open.c +++ b/fs/open.c @@ -527,6 +527,18 @@ SYSCALL_DEFINE2(access, const char __user *, filename, int, mode) return sys_faccessat(AT_FDCWD, filename, mode); } +int do_chdir(struct fs_struct *fs, struct path *path) +{ + int error; + + error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS); + if (error) + return error; + + set_fs_pwd(fs, path); + return 0; +} + SYSCALL_DEFINE1(chdir, const char __user *, filename) { struct path path; @@ -534,17 +546,10 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename) error = user_path_dir(filename, &path); if (error) - goto out; - - error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS); - if (error) - goto dput_and_out; - - set_fs_pwd(current->fs, &path); + return error; -dput_and_out: + error = do_chdir(current->fs, &path); path_put(&path); -out: return error; } @@ -574,31 +579,36 @@ out: return error; } -SYSCALL_DEFINE1(chroot, const char __user *, filename) +int do_chroot(struct fs_struct *fs, struct path *path) { - struct path path; int error; - error = user_path_dir(filename, &path); + error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS); if (error) - goto out; + return error; + + if (!capable(CAP_SYS_CHROOT)) + return -EPERM; - error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS); + error = security_path_chroot(path); if (error) - goto dput_and_out; + return error; - error = -EPERM; - if (!capable(CAP_SYS_CHROOT)) - goto dput_and_out; - error = security_path_chroot(&path); + set_fs_root(fs, path); + return 0; +} + +SYSCALL_DEFINE1(chroot, const char __user *, filename) +{ + struct path path; + int error; + + error = user_path_dir(filename, &path); if (error) - goto dput_and_out; + return error; - set_fs_root(current->fs, &path); - error = 0; -dput_and_out: + error = do_chroot(current->fs, &path); path_put(&path); -out: return error; } diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index ca91405..3e0937a 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -10,7 +10,7 @@ * distribution for more details. */ -#define CHECKPOINT_VERSION 3 +#define CHECKPOINT_VERSION 4 /* checkpoint user flags */ #define CHECKPOINT_SUBTREE 0x1 @@ -236,6 +236,12 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file, struct ckpt_hdr_file *h); +extern int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t); +extern int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref); +extern int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr); +extern void *restore_fs(struct ckpt_ctx *ctx); + /* credentials */ extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr); extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 0b36430..4dc852d 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -131,6 +131,9 @@ enum { CKPT_HDR_MM_CONTEXT, #define CKPT_HDR_MM_CONTEXT CKPT_HDR_MM_CONTEXT + CKPT_HDR_FS = 451, /* must be after file-table, mm */ +#define CKPT_HDR_FS CKPT_HDR_FS + CKPT_HDR_IPC = 501, #define CKPT_HDR_IPC CKPT_HDR_IPC CKPT_HDR_IPC_SHM, @@ -201,6 +204,8 @@ enum obj_type { #define CKPT_OBJ_FILE CKPT_OBJ_FILE CKPT_OBJ_MM, #define CKPT_OBJ_MM CKPT_OBJ_MM + CKPT_OBJ_FS, +#define CKPT_OBJ_FS CKPT_OBJ_FS CKPT_OBJ_SIGHAND, #define CKPT_OBJ_SIGHAND CKPT_OBJ_SIGHAND CKPT_OBJ_SIGNAL, @@ -416,6 +421,7 @@ struct ckpt_hdr_task_objs { __s32 files_objref; __s32 mm_objref; + __s32 fs_objref; __s32 sighand_objref; __s32 signal_objref; } __attribute__((aligned(8))); @@ -453,6 +459,12 @@ enum restart_block_type { }; /* file system */ +struct ckpt_hdr_fs { + struct ckpt_hdr h; + /* char *fs_root */ + /* char *fs_pwd */ +} __attribute__((aligned(8))); + struct ckpt_hdr_file_table { struct ckpt_hdr h; __s32 fdt_nfds; diff --git a/include/linux/fs.h b/include/linux/fs.h index 7902a51..a1525aa 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1818,6 +1818,10 @@ extern void drop_collected_mounts(struct vfsmount *); extern int vfs_statfs(struct dentry *, struct kstatfs *); +struct fs_struct; +extern int do_chdir(struct fs_struct *fs, struct path *path); +extern int do_chroot(struct fs_struct *fs, struct path *path); + extern int current_umask(void); /* /sys/fs */ diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h index 78a05bf..a73cbcb 100644 --- a/include/linux/fs_struct.h +++ b/include/linux/fs_struct.h @@ -20,5 +20,7 @@ extern struct fs_struct *copy_fs_struct(struct fs_struct *); extern void free_fs_struct(struct fs_struct *); extern void daemonize_fs_struct(void); extern int unshare_fs_struct(void); +extern void get_fs_struct(struct fs_struct *); +extern void put_fs_struct(struct fs_struct *); #endif /* _LINUX_FS_STRUCT_H */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> ` (15 preceding siblings ...) 2010-03-19 1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan @ 2010-03-19 1:00 ` Oren Laadan 16 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 1:00 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger We only allow c/r when all processes shared a single mounts ns. We do intend to implement c/r of mounts and mounts namespaces in the kernel. It shouldn't be ugly or complicate locking to do so. Just haven't gotten around to it. A more complete solution is more than we want to take on now for v19. But we'd like as much as possible for everything which we don't support, to not be checkpointable, since not doing so has in the past invited slanderous accusations of being a toy implementation :) Meanwhile, we get the following: 1) Checkpoint bails if not all tasks share the same mnt-ns 2) Leak detection works for full container checkpoint On restart, all tasks inherit the same mnt-ns of the coordinator, by default. A follow-up patch to user-cr will add a new switch to the 'restart' to request a CLONE_NEWMNT flag when creating the root-task of the restart. Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/objhash.c | 25 +++++++++++++++++++++++++ include/linux/checkpoint.h | 2 +- include/linux/checkpoint_hdr.h | 4 ++++ kernel/nsproxy.c | 16 +++++++++++++--- 4 files changed, 43 insertions(+), 4 deletions(-) diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 5c4749d..42998b2 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -19,6 +19,7 @@ #include <linux/sched.h> #include <linux/ipc_namespace.h> #include <linux/user_namespace.h> +#include <linux/mnt_namespace.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> #include <net/sock.h> @@ -214,6 +215,22 @@ static int obj_ipc_ns_users(void *ptr) return atomic_read(&((struct ipc_namespace *) ptr)->count); } +static int obj_mnt_ns_grab(void *ptr) +{ + get_mnt_ns((struct mnt_namespace *) ptr); + return 0; +} + +static void obj_mnt_ns_drop(void *ptr, int lastref) +{ + put_mnt_ns((struct mnt_namespace *) ptr); +} + +static int obj_mnt_ns_users(void *ptr) +{ + return atomic_read(&((struct mnt_namespace *) ptr)->count); +} + static int obj_cred_grab(void *ptr) { get_cred((struct cred *) ptr); @@ -411,6 +428,14 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_ipc_ns, .restore = restore_ipc_ns, }, + /* mnt_ns object */ + { + .obj_name = "MOUNTS NS", + .obj_type = CKPT_OBJ_MNT_NS, + .ref_grab = obj_mnt_ns_grab, + .ref_drop = obj_mnt_ns_drop, + .ref_users = obj_mnt_ns_users, + }, /* user_ns object */ { .obj_name = "USER_NS", diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 3e0937a..64b4b8a 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -10,7 +10,7 @@ * distribution for more details. */ -#define CHECKPOINT_VERSION 4 +#define CHECKPOINT_VERSION 5 /* checkpoint user flags */ #define CHECKPOINT_SUBTREE 0x1 diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 4dc852d..28dfc36 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -90,6 +90,8 @@ enum { #define CKPT_HDR_UTS_NS CKPT_HDR_UTS_NS CKPT_HDR_IPC_NS, #define CKPT_HDR_IPC_NS CKPT_HDR_IPC_NS + CKPT_HDR_MNT_NS, +#define CKPT_HDR_MNT_NS CKPT_HDR_MNT_NS CKPT_HDR_CAPABILITIES, #define CKPT_HDR_CAPABILITIES CKPT_HDR_CAPABILITIES CKPT_HDR_USER_NS, @@ -216,6 +218,8 @@ enum obj_type { #define CKPT_OBJ_UTS_NS CKPT_OBJ_UTS_NS CKPT_OBJ_IPC_NS, #define CKPT_OBJ_IPC_NS CKPT_OBJ_IPC_NS + CKPT_OBJ_MNT_NS, +#define CKPT_OBJ_MNT_NS CKPT_OBJ_MNT_NS CKPT_OBJ_USER_NS, #define CKPT_OBJ_USER_NS CKPT_OBJ_USER_NS CKPT_OBJ_CRED, diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 17b048e..0da0d83 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -255,10 +255,17 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t) * ipc_ns (shm) may keep references to files: if this is the * first time we see this ipc_ns (ret > 0), proceed inside. */ - if (ret) + if (ret) { ret = ckpt_collect_ipc_ns(ctx, nsproxy->ipc_ns); + if (ret < 0) + goto out; + } - /* TODO: collect other namespaces here */ + ret = ckpt_obj_collect(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS); + if (ret < 0) + goto out; + + ret = 0; out: put_nsproxy(nsproxy); return ret; @@ -282,7 +289,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) goto out; h->ipc_objref = ret; - /* TODO: Write other namespaces here */ + /* FIXME: for now, only marked visited to pacify leaks */ + ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS); + if (ret < 0) + goto out; ret = ckpt_write_obj(ctx, &h->h); out: -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 51/96] c/r: support for open pipes 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (9 preceding siblings ...) [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan ` (6 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan A pipe is a double-headed inode with a buffer attached to it. We checkpoint the pipe buffer only once, as soon as we hit one side of the pipe, regardless whether it is read- or write- end. To checkpoint a file descriptor that refers to a pipe (either end), we first lookup the inode in the hash table: If not found, it is the first encounter of this pipe. Besides the file descriptor, we also (a) save the pipe data, and (b) register the pipe inode in the hash. If found, it is the second encounter of this pipe, namely, as we hit the other end of the same pipe. In both cases we write the pipe-objref of the inode. To restore, create a new pipe and thus have two file pointers (read- and write- ends). We only use one of them, depending on which side was checkpointed first. We register the file pointer of the other end in the hash table, with the pipe_objref given for this pipe from the checkpoint, to be used later when the other arrives. At this point we also restore the contents of the pipe buffers. To save the pipe buffer, given a source pipe, use do_tee() to clone its contents into a temporary 'struct pipe_inode_info', and then use do_splice_from() to transfer it directly to the checkpoint image file. To restore the pipe buffer, with a fresh newly allocated target pipe, use do_splice_to() to splice the data directly between the checkpoint image file and the pipe. Changelog[v19-rc1]: - Switch to ckpt_obj_try_fetch() - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Adjust format of pipe buffer to include the mandatory pre-header Changelog[v17]: - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 7 ++ fs/pipe.c | 157 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint_hdr.h | 9 +++ include/linux/pipe_fs_i.h | 8 ++ 4 files changed, 181 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index b404c8f..1c294fe 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -17,6 +17,7 @@ #include <linux/file.h> #include <linux/fdtable.h> #include <linux/fsnotify.h> +#include <linux/pipe_fs_i.h> #include <linux/syscalls.h> #include <linux/deferqueue.h> #include <linux/checkpoint.h> @@ -592,6 +593,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_GENERIC, .restore = generic_file_restore, }, + /* pipes */ + { + .file_name = "PIPE", + .file_type = CKPT_FILE_PIPE, + .restore = pipe_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/pipe.c b/fs/pipe.c index 37ba29f..747b2d7 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -13,11 +13,13 @@ #include <linux/fs.h> #include <linux/mount.h> #include <linux/pipe_fs_i.h> +#include <linux/splice.h> #include <linux/uio.h> #include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/audit.h> #include <linux/syscalls.h> +#include <linux/checkpoint.h> #include <asm/uaccess.h> #include <asm/ioctls.h> @@ -828,6 +830,158 @@ pipe_rdwr_open(struct inode *inode, struct file *filp) return ret; } +#ifdef CONFIG_CHECKPOINT +static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode) +{ + struct pipe_inode_info *pipe; + int len, ret = -ENOMEM; + + pipe = alloc_pipe_info(NULL); + if (!pipe) + return ret; + + pipe->readers = 1; /* bluff link_pipe() below */ + len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK); + if (len == -EAGAIN) + len = 0; + if (len < 0) { + ret = len; + goto out; + } + + ret = ckpt_write_obj_type(ctx, NULL, len, CKPT_HDR_PIPE_BUF); + if (ret < 0) + goto out; + + ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0); + if (ret < 0) + goto out; + if (ret != len) + ret = -EPIPE; /* can occur due to an error in target file */ + out: + __free_pipe_info(pipe); + return ret; +} + +static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_pipe *h; + struct inode *inode = file->f_dentry->d_inode; + int objref, first, ret; + + objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first); + if (objref < 0) + return objref; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_PIPE; + h->pipe_objref = objref; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->common.h); + if (ret < 0) + goto out; + + if (first) + ret = checkpoint_pipe(ctx, inode); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static int restore_pipe(struct ckpt_ctx *ctx, struct file *file) +{ + struct pipe_inode_info *pipe; + int len, ret; + + len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_PIPE_BUF); + if (len < 0) + return len; + + pipe = file->f_dentry->d_inode->i_pipe; + ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0); + + if (ret >= 0 && ret != len) + ret = -EPIPE; /* can occur due to an error in source file */ + + return ret; +} + +struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr) +{ + struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr; + struct file *file; + int fds[2], which, ret; + + if (ptr->h.type != CKPT_HDR_FILE || + ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE) + return ERR_PTR(-EINVAL); + + if (h->pipe_objref <= 0) + return ERR_PTR(-EINVAL); + + file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE); + /* + * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), then this is + * the first time we see this pipe so need to restore the + * contents. Otherwise, use the file pointer skip forward. + */ + if (!IS_ERR(file)) { + get_file(file); + } else if (PTR_ERR(file) == -EINVAL) { + /* first encounter of this pipe: create it */ + ret = do_pipe_flags(fds, 0); + if (ret < 0) + return file; + + which = (ptr->f_flags & O_WRONLY ? 1 : 0); + /* + * Below we return the file corersponding to one side + * of the pipe for our caller to use. Now insert the + * other side of the pipe to the hash, to be picked up + * when that side is restored. + */ + file = fget(fds[1-which]); /* the 'other' side */ + if (!file) /* this should _never_ happen ! */ + return ERR_PTR(-EBADF); + ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE); + if (ret < 0) + goto out; + + ret = restore_pipe(ctx, file); + fput(file); + if (ret < 0) + return ERR_PTR(ret); + + file = fget(fds[which]); /* 'this' side */ + if (!file) /* this should _never_ happen ! */ + return ERR_PTR(-EBADF); + + /* get rid of the file descriptors (caller sets that) */ + sys_close(fds[which]); + sys_close(fds[1-which]); + } else { + return file; + } + + ret = restore_file_common(ctx, file, ptr); + out: + if (ret < 0) { + fput(file); + file = ERR_PTR(ret); + } + + return file; +} +#else +#define pipe_file_checkpoint NULL +#endif /* CONFIG_CHECKPOINT */ + /* * The file_operations structs are not static because they * are also used in linux/fs/fifo.c to do operations on FIFOs. @@ -844,6 +998,7 @@ const struct file_operations read_pipefifo_fops = { .open = pipe_read_open, .release = pipe_read_release, .fasync = pipe_read_fasync, + .checkpoint = pipe_file_checkpoint, }; const struct file_operations write_pipefifo_fops = { @@ -856,6 +1011,7 @@ const struct file_operations write_pipefifo_fops = { .open = pipe_write_open, .release = pipe_write_release, .fasync = pipe_write_fasync, + .checkpoint = pipe_file_checkpoint, }; const struct file_operations rdwr_pipefifo_fops = { @@ -869,6 +1025,7 @@ const struct file_operations rdwr_pipefifo_fops = { .open = pipe_rdwr_open, .release = pipe_rdwr_release, .fasync = pipe_rdwr_fasync, + .checkpoint = pipe_file_checkpoint, }; struct pipe_inode_info * alloc_pipe_info(struct inode *inode) diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 6fae6ef..885d06b 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -90,6 +90,8 @@ enum { #define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME CKPT_HDR_FILE, #define CKPT_HDR_FILE CKPT_HDR_FILE + CKPT_HDR_PIPE_BUF, +#define CKPT_HDR_PIPE_BUF CKPT_HDR_PIPE_BUF CKPT_HDR_MM = 401, #define CKPT_HDR_MM CKPT_HDR_MM @@ -277,6 +279,8 @@ enum file_type { #define CKPT_FILE_IGNORE CKPT_FILE_IGNORE CKPT_FILE_GENERIC, #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC + CKPT_FILE_PIPE, +#define CKPT_FILE_PIPE CKPT_FILE_PIPE CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; @@ -296,6 +300,11 @@ struct ckpt_hdr_file_generic { struct ckpt_hdr_file common; } __attribute__((aligned(8))); +struct ckpt_hdr_file_pipe { + struct ckpt_hdr_file common; + __s32 pipe_objref; +} __attribute__((aligned(8))); + /* memory layout */ struct ckpt_hdr_mm { struct ckpt_hdr h; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index b43a9e0..e526a12 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *); void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *); +/* checkpoint/restart */ +#ifdef CONFIG_CHECKPOINT +struct ckpt_ctx; +struct ckpt_hdr_file; +extern struct file *pipe_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr); +#endif + #endif -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (10 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan ` (5 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan FIFOs are almost like pipes. Checkpoints adds the FIFO pathname. The first time the FIFO is found it also assigns an @objref and dumps the contents in the buffers. To restore, use the @objref only to determine whether a particular FIFO has already been restored earlier. Note that it ignores the file pointer that matches that @objref (unlike with pipes, where that file corresponds to the other end of the pipe). Instead, it creates a new FIFO using the saved pathname. Changelog [v19-rc3]: - Rebase to kernel 2.6.33 Changelog [v19-rc1]: - Switch to ckpt_obj_try_fetch() - [Matt Helsley] Add cpp definitions for enums Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 6 +++ fs/pipe.c | 81 +++++++++++++++++++++++++++++++++++++++- include/linux/checkpoint_hdr.h | 2 + include/linux/pipe_fs_i.h | 2 + 4 files changed, 90 insertions(+), 1 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 1c294fe..c647bfd 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -599,6 +599,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_PIPE, .restore = pipe_file_restore, }, + /* fifo */ + { + .file_name = "FIFO", + .file_type = CKPT_FILE_FIFO, + .restore = fifo_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/pipe.c b/fs/pipe.c index 747b2d7..8c79493 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -830,6 +830,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp) return ret; } +static struct vfsmount *pipe_mnt __read_mostly; + #ifdef CONFIG_CHECKPOINT static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode) { @@ -877,7 +879,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) if (!h) return -ENOMEM; - h->common.f_type = CKPT_FILE_PIPE; + /* fifo and pipe are similar at checkpoint, differ on restore */ + if (inode->i_sb == pipe_mnt->mnt_sb) + h->common.f_type = CKPT_FILE_PIPE; + else + h->common.f_type = CKPT_FILE_FIFO; h->pipe_objref = objref; ret = checkpoint_file_common(ctx, file, &h->common); @@ -887,6 +893,13 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) if (ret < 0) goto out; + /* FIFO also needs a file name */ + if (h->common.f_type == CKPT_FILE_FIFO) { + ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); + if (ret < 0) + goto out; + } + if (first) ret = checkpoint_pipe(ctx, inode); out: @@ -978,8 +991,74 @@ struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr) return file; } + +struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr) +{ + struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr; + struct file *file; + int first, ret; + + if (ptr->h.type != CKPT_HDR_FILE || + ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_FIFO) + return ERR_PTR(-EINVAL); + + if (h->pipe_objref <= 0) + return ERR_PTR(-EINVAL); + + /* + * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), this is the + * first time for this fifo. + */ + file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE); + if (!IS_ERR(file)) + first = 0; + else if (PTR_ERR(file) == -EINVAL) + first = 1; + else + return file; + + /* + * To avoid blocking, always open the fifo with O_RDWR; + * then fix flags below. + */ + file = restore_open_fname(ctx, (ptr->f_flags & ~O_ACCMODE) | O_RDWR); + if (IS_ERR(file)) + return file; + + if ((ptr->f_flags & O_ACCMODE) == O_RDONLY) { + file->f_flags = (file->f_flags & ~O_ACCMODE) | O_RDONLY; + file->f_mode &= ~FMODE_WRITE; + } else if ((ptr->f_flags & O_ACCMODE) == O_WRONLY) { + file->f_flags = (file->f_flags & ~O_ACCMODE) | O_WRONLY; + file->f_mode &= ~FMODE_READ; + } else if ((ptr->f_flags & O_ACCMODE) != O_RDWR) { + ret = -EINVAL; + goto out; + } + + /* first time: add to objhash and restore fifo's contents */ + if (first) { + ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE); + if (ret < 0) + goto out; + + ret = restore_pipe(ctx, file); + if (ret < 0) + goto out; + } + + ret = restore_file_common(ctx, file, ptr); + out: + if (ret < 0) { + fput(file); + file = ERR_PTR(ret); + } + + return file; +} #else #define pipe_file_checkpoint NULL +#define fifo_file_checkpoint NULL #endif /* CONFIG_CHECKPOINT */ /* diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 885d06b..fce35f3 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -281,6 +281,8 @@ enum file_type { #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC CKPT_FILE_PIPE, #define CKPT_FILE_PIPE CKPT_FILE_PIPE + CKPT_FILE_FIFO, +#define CKPT_FILE_FIFO CKPT_FILE_FIFO CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index e526a12..596403e 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -160,6 +160,8 @@ struct ckpt_ctx; struct ckpt_hdr_file; extern struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr); +extern struct file *fifo_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr); #endif #endif -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (11 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan ` (4 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger From: Matt Helsley <matthltc@us.ibm.com> We do not support restarting fsnotify watches. inotify and fanotify utilize anon_inodes for pseudofiles which lack the .checkpoint operation. So they already cleanly prevent checkpoint. dnotify on the other hand registers its watches using fcntl() which does not require the userspace task to hold an fd with an empty .checkpoint operation. This means userspace could use dnotify to set up fsnotify watches which won't be re-created during restart. Check for fsnotify watches created with dnotify and reject checkpoint if there are any. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 5 +++++ fs/notify/dnotify/dnotify.c | 18 ++++++++++++++++++ include/linux/dnotify.h | 6 ++++++ 3 files changed, 29 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index c647bfd..62feadd 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -207,6 +207,11 @@ int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) return -EBADF; } + if (is_dnotify_attached(file)) { + ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file); + return -EBADF; + } + ret = file->f_op->checkpoint(ctx, file); if (ret < 0) ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c index 7e54e52..0a63bf6 100644 --- a/fs/notify/dnotify/dnotify.c +++ b/fs/notify/dnotify/dnotify.c @@ -289,6 +289,24 @@ static int attach_dn(struct dnotify_struct *dn, struct dnotify_mark_entry *dnent return 0; } +int is_dnotify_attached(struct file *filp) +{ + struct fsnotify_mark_entry *entry; + struct inode *inode; + + inode = filp->f_path.dentry->d_inode; + if (!S_ISDIR(inode->i_mode)) + return 0; + + spin_lock(&inode->i_lock); + entry = fsnotify_find_mark_entry(dnotify_group, inode); + spin_unlock(&inode->i_lock); + if (!entry) + return 0; + fsnotify_put_mark(entry); + return 1; +} + /* * When a process calls fcntl to attach a dnotify watch to a directory it ends * up here. Allocate both a mark for fsnotify to add and a dnotify_struct to be diff --git a/include/linux/dnotify.h b/include/linux/dnotify.h index ecc0628..b9ce13c 100644 --- a/include/linux/dnotify.h +++ b/include/linux/dnotify.h @@ -29,6 +29,7 @@ struct dnotify_struct { FS_MOVED_FROM | FS_MOVED_TO) extern void dnotify_flush(struct file *, fl_owner_t); +extern int is_dnotify_attached(struct file *); extern int fcntl_dirnotify(int, struct file *, unsigned long); #else @@ -37,6 +38,11 @@ static inline void dnotify_flush(struct file *filp, fl_owner_t id) { } +static inline int is_dnotify_attached(struct file *) +{ + return 0; +} + static inline int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg) { return -EINVAL; -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 66/96] c/r: restore file->f_cred 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (12 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan ` (3 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Serge E. Hallyn From: Serge E. Hallyn <serue@us.ibm.com> Restore a file's f_cred. This is set to the cred of the task doing the open, so often it will be the same as that of the restarted task. Changelog[v1]: - [Nathan Lynch] discard const from struct cred * where appropriate Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- checkpoint/files.c | 18 ++++++++++++++++-- include/linux/checkpoint_hdr.h | 2 +- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 62feadd..63a611f 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -148,15 +148,21 @@ static int scan_fds(struct files_struct *files, int **fdtable) int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, struct ckpt_hdr_file *h) { + struct cred *f_cred = (struct cred *) file->f_cred; + h->f_flags = file->f_flags; h->f_mode = file->f_mode; h->f_pos = file->f_pos; h->f_version = file->f_version; + h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED); + if (h->f_credref < 0) + return h->f_credref; + ckpt_debug("file %s credref %d", file->f_dentry->d_name.name, h->f_credref); - /* FIX: need also file->uid, file->gid, file->f_owner, etc */ + /* FIX: need also file->f_owner, etc */ return 0; } @@ -522,8 +528,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file, fmode_t new_mode = file->f_mode; fmode_t saved_mode = (__force fmode_t) h->f_mode; int ret; + struct cred *cred; + + /* FIX: need to restore owner etc */ - /* FIX: need to restore uid, gid, owner etc */ + /* restore the cred */ + cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED); + if (IS_ERR(cred)) + return PTR_ERR(cred); + put_cred(file->f_cred); + file->f_cred = get_cred(cred); /* safe to set 1st arg (fd) to 0, as command is F_SETFL */ ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index cbccc81..729be96 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -432,7 +432,7 @@ struct ckpt_hdr_file { __u32 f_type; __u32 f_mode; __u32 f_flags; - __u32 _padding; + __s32 f_credref; __u64 f_pos; __u64 f_version; } __attribute__((aligned(8))); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (13 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan ` (2 subsequent siblings) 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger From: Matt Helsley <matthltc@us.ibm.com> Save/restore epoll items during checkpoint/restart respectively. Output the epoll header and items separately. Chunk the output much like the pid array gets chunked. This ensures that even sub-order 0 allocations will enable checkpoint of large epoll sets. A subsequent patch will do something similar for the restore path. On restart, we grab a piece of memory suitable to store a "chunk" of items for input. Read the input one chunk at a time and add epoll items for each item in the chunk. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge Hallyn <serue@us.ibm.com> Changelog [v19]: - [Oren Laadan] Fix broken compilation for no-c/r architectures Changelog [v19-rc1]: - [Oren Laadan] Return -EBUSY (not BUG_ON) if fd is gone on restart - [Oren Laadan] Fix the chunk size instead of auto-tune Changelog v5: Fix potential recursion during collect. Replace call to ckpt_obj_collect() with ckpt_collect_file(). [Oren] Fix checkpoint leak detection when there are more items than expected. Cleanup/simplify error write paths. (will complicate in a later patch) [Oren] Remove files_deferq bits. [Oren] Remove extra newline. [Oren] Remove aggregate check on number of watches added. [Oren] This is OK since these will be done individually anyway. Remove check for negative objrefs during restart. [Oren] Fixup comment regarding race that indicates checkpoint leaks. [Oren] s/ckpt_read_obj/ckpt_read_buf_type/ [Oren] Patch for lots of epoll items follows. Moved sys_close(epfd) right under fget(). [Oren] Use CKPT_HDR_BUFFER rather than custome ckpt_read/write_* This makes it more similar to the pid array code. [Oren] It also simplifies the error recovery paths. Tested polling a pipe and 50,000 UNIX sockets. Changelog v4: ckpt-v18 Use files_deferq as submitted by Dan Smith Cleanup to only report >= 1 items when debugging. Changelog v3: [unposted] Removed most of the TODOs -- the remainder will be removed by subsequent patches. Fixed missing ep_file_collect() [Serge] Rather than include checkpoint_hdr.h declare (but do not define) the two structs needed in eventpoll.h [Oren] Complain with ckpt_write_err() when we detect checkpoint obj leaks. [Oren] Remove redundant is_epoll_file() check in collect. [Oren] Move epfile_objref lookup to simplify error handling. [Oren] Simplify error handling with early return in ep_eventpoll_checkpoint(). [Oren] Cleaned up a comment. [Oren] Shorten CKPT_HDR_FILE_EPOLL_ITEMS (-FILE) [Oren] Renumbered to indicate that it follows the file table. Renamed the epoll struct in checkpoint_hdr.h [Oren] Also renamed substruct. Fixup return of empty ep_file_restore(). [Oren] Changed some error returns. [Oren] Changed some tests to BUG_ON(). [Oren] Factored out watch insert with epoll_ctl() into do_epoll_ctl(). [Cedric, Oren] Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 7 + fs/eventpoll.c | 334 ++++++++++++++++++++++++++++++++++++---- include/linux/checkpoint_hdr.h | 18 ++ include/linux/eventpoll.h | 17 ++- 4 files changed, 347 insertions(+), 29 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index bcc1fbf..6aaaf22 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -22,6 +22,7 @@ #include <linux/deferqueue.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> +#include <linux/eventpoll.h> #include <net/sock.h> @@ -637,6 +638,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_TTY, .restore = tty_file_restore, }, + /* epoll */ + { + .file_name = "EPOLL", + .file_type = CKPT_FILE_EPOLL, + .restore = ep_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index bd056a5..7f1a091 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -39,6 +39,9 @@ #include <asm/mman.h> #include <asm/atomic.h> +#include <linux/checkpoint.h> +#include <linux/deferqueue.h> + /* * LOCKING: * There are three level of locking required by epoll : @@ -671,10 +674,20 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) return pollflags != -1 ? pollflags : 0; } +#ifdef CONFIG_CHECKPOINT +static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file); +static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file); +#else +#define ep_eventpoll_checkpoint NULL +#define ep_file_collect NULL +#endif + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { .release = ep_eventpoll_release, - .poll = ep_eventpoll_poll + .poll = ep_eventpoll_poll, + .checkpoint = ep_eventpoll_checkpoint, + .collect = ep_file_collect, }; /* Fast test to see if the file is an evenpoll file */ @@ -1226,35 +1239,18 @@ SYSCALL_DEFINE1(epoll_create, int, size) * the eventpoll file that enables the insertion/removal/change of * file descriptors inside the interest set. */ -SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, - struct epoll_event __user *, event) +int do_epoll_ctl(int op, int fd, + struct file *file, struct file *tfile, + struct epoll_event *epds) { int error; - struct file *file, *tfile; struct eventpoll *ep; struct epitem *epi; - struct epoll_event epds; - - error = -EFAULT; - if (ep_op_has_event(op) && - copy_from_user(&epds, event, sizeof(struct epoll_event))) - goto error_return; - - /* Get the "struct file *" for the eventpoll file */ - error = -EBADF; - file = fget(epfd); - if (!file) - goto error_return; - - /* Get the "struct file *" for the target file */ - tfile = fget(fd); - if (!tfile) - goto error_fput; /* The target file descriptor must support poll */ error = -EPERM; if (!tfile->f_op || !tfile->f_op->poll) - goto error_tgt_fput; + return error; /* * We have to check that the file structure underneath the file descriptor @@ -1263,7 +1259,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, */ error = -EINVAL; if (file == tfile || !is_file_epoll(file)) - goto error_tgt_fput; + return error; /* * At this point it is safe to assume that the "private_data" contains @@ -1284,8 +1280,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, switch (op) { case EPOLL_CTL_ADD: if (!epi) { - epds.events |= POLLERR | POLLHUP; - error = ep_insert(ep, &epds, tfile, fd); + epds->events |= POLLERR | POLLHUP; + error = ep_insert(ep, epds, tfile, fd); } else error = -EEXIST; break; @@ -1297,15 +1293,46 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, break; case EPOLL_CTL_MOD: if (epi) { - epds.events |= POLLERR | POLLHUP; - error = ep_modify(ep, epi, &epds); + epds->events |= POLLERR | POLLHUP; + error = ep_modify(ep, epi, epds); } else error = -ENOENT; break; } mutex_unlock(&ep->mtx); -error_tgt_fput: + return error; +} + +/* + * The following function implements the controller interface for + * the eventpoll file that enables the insertion/removal/change of + * file descriptors inside the interest set. + */ +SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, + struct epoll_event __user *, event) +{ + int error; + struct file *file, *tfile; + struct epoll_event epds; + + error = -EFAULT; + if (ep_op_has_event(op) && + copy_from_user(&epds, event, sizeof(struct epoll_event))) + goto error_return; + + /* Get the "struct file *" for the eventpoll file */ + error = -EBADF; + file = fget(epfd); + if (!file) + goto error_return; + + /* Get the "struct file *" for the target file */ + tfile = fget(fd); + if (!tfile) + goto error_fput; + + error = do_epoll_ctl(op, fd, file, tfile, &epds); fput(tfile); error_fput: fput(file); @@ -1413,6 +1440,257 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, #endif /* HAVE_SET_RESTORE_SIGMASK */ +#ifdef CONFIG_CHECKPOINT +static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file) +{ + struct rb_node *rbp; + struct eventpoll *ep; + int ret = 0; + + ep = file->private_data; + mutex_lock(&ep->mtx); + for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) { + struct epitem *epi; + + epi = rb_entry(rbp, struct epitem, rbn); + if (is_file_epoll(epi->ffd.file)) + continue; /* Don't recurse */ + ret = ckpt_collect_file(ctx, epi->ffd.file); + if (ret < 0) + break; + } + mutex_unlock(&ep->mtx); + return ret; +} + +struct epoll_deferq_entry { + struct ckpt_ctx *ctx; + struct file *epfile; +}; + +#define CKPT_EPOLL_CHUNK (8096 / (int) sizeof(struct ckpt_eventpoll_item)) + +static int ep_items_checkpoint(void *data) +{ + struct epoll_deferq_entry *dq_entry = data; + struct ckpt_ctx *ctx; + struct ckpt_hdr_eventpoll_items *h; + struct ckpt_eventpoll_item *items; + struct rb_node *rbp; + struct eventpoll *ep; + __s32 epfile_objref; + int num_items = 0, ret; + + ctx = dq_entry->ctx; + + epfile_objref = ckpt_obj_lookup(ctx, dq_entry->epfile, CKPT_OBJ_FILE); + BUG_ON(epfile_objref <= 0); + + ep = dq_entry->epfile->private_data; + mutex_lock(&ep->mtx); + for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) + num_items++; + mutex_unlock(&ep->mtx); + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS); + if (!h) + return -ENOMEM; + h->num_items = num_items; + h->epfile_objref = epfile_objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret || !num_items) + return ret; + + ret = ckpt_write_obj_type(ctx, NULL, sizeof(*items)*num_items, + CKPT_HDR_BUFFER); + if (ret < 0) + return ret; + + items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL); + if (!items) + return -ENOMEM; + + /* + * Walk the rbtree copying items into the chunk of memory and then + * writing them to the checkpoint image + */ + ret = 0; + mutex_lock(&ep->mtx); + rbp = rb_first(&ep->rbr); + while ((num_items > 0) && rbp) { + int n = min(num_items, CKPT_EPOLL_CHUNK); + int j; + + for (j = 0; rbp && j < n; j++, rbp = rb_next(rbp)) { + struct epitem *epi; + int objref; + + epi = rb_entry(rbp, struct epitem, rbn); + items[j].fd = epi->ffd.fd; + items[j].events = epi->event.events; + items[j].data = epi->event.data; + objref = ckpt_obj_lookup(ctx, epi->ffd.file, + CKPT_OBJ_FILE); + if (objref <= 0) + goto unlock; + items[j].file_objref = objref; + } + ret = ckpt_kwrite(ctx, items, n*sizeof(*items)); + if (ret < 0) + break; + num_items -= n; + } +unlock: + mutex_unlock(&ep->mtx); + kfree(items); + if (num_items != 0 || (num_items == 0 && rbp)) + ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */ + if (ret) + ckpt_err(ctx, ret, "Checkpointing epoll items.\n"); + return ret; +} + +static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file *h; + struct epoll_deferq_entry dq_entry; + int ret = -ENOMEM; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + h->f_type = CKPT_FILE_EPOLL; + ret = checkpoint_file_common(ctx, file, h); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->h); + if (ret < 0) + goto out; + + /* + * Defer saving the epoll items until all of the ffd.file pointers + * have an objref; after the file table has been checkpointed. + */ + dq_entry.ctx = ctx; + dq_entry.epfile = file; + ret = deferqueue_add(ctx->files_deferq, &dq_entry, + sizeof(dq_entry), ep_items_checkpoint, NULL); +out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static int ep_items_restore(void *data) +{ + struct ckpt_ctx *ctx = deferqueue_data_ptr(data); + struct ckpt_hdr_eventpoll_items *h; + struct ckpt_eventpoll_item *items = NULL; + struct eventpoll *ep; + struct file *epfile = NULL; + int ret, num_items; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS); + if (IS_ERR(h)) + return PTR_ERR(h); + num_items = h->num_items; + epfile = ckpt_obj_fetch(ctx, h->epfile_objref, CKPT_OBJ_FILE); + ckpt_hdr_put(ctx, h); + + /* Make sure userspace didn't give us a ref to a non-epoll file. */ + if (IS_ERR(epfile)) + return PTR_ERR(epfile); + if (!is_file_epoll(epfile)) + return -EINVAL; + if (!num_items) + return 0; + + ret = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER); + if (ret < 0) + return ret; + /* Make sure the items match the size we expect */ + if (num_items != (ret / sizeof(*items))) + return -EINVAL; + + items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL); + if (!items) + return -ENOMEM; + + ep = epfile->private_data; + + while (num_items > 0) { + int n = min(num_items, CKPT_EPOLL_CHUNK); + int j; + + ret = ckpt_kread(ctx, items, n*sizeof(*items)); + if (ret < 0) + break; + + /* Restore the epoll items/watches */ + for (j = 0; !ret && j < n; j++) { + struct epoll_event epev; + struct file *tfile; + + tfile = ckpt_obj_fetch(ctx, items[j].file_objref, + CKPT_OBJ_FILE); + if (IS_ERR(tfile)) { + ret = PTR_ERR(tfile); + goto out; + } + epev.events = items[j].events; + epev.data = items[j].data; + ret = do_epoll_ctl(EPOLL_CTL_ADD, items[j].fd, + epfile, tfile, &epev); + } + num_items -= n; + } +out: + kfree(items); + return ret; +} + +struct file *ep_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *h) +{ + struct file *epfile; + int epfd, ret; + + if (h->h.type != CKPT_HDR_FILE || + h->h.len != sizeof(*h) || + h->f_type != CKPT_FILE_EPOLL) + return ERR_PTR(-EINVAL); + + epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC); + if (epfd < 0) + return ERR_PTR(epfd); + epfile = fget(epfd); + sys_close(epfd); /* harmless even if an error occured */ + if (!epfile) /* can happen with a malicious user */ + return ERR_PTR(-EBUSY); + + /* + * Needed before we can properly restore the watches and enforce the + * limit on watch numbers. + */ + ret = restore_file_common(ctx, epfile, h); + if (ret < 0) + goto fput_out; + + /* + * Defer restoring the epoll items until the file table is + * fully restored. Ensures that valid file objrefs will resolve. + */ + ret = deferqueue_add_ptr(ctx->files_deferq, ctx, ep_items_restore, NULL); + if (ret < 0) { +fput_out: + fput(epfile); + epfile = ERR_PTR(ret); + } + return epfile; +} + +#endif /* CONFIG_CHECKPOINT */ + static int __init eventpoll_init(void) { struct sysinfo si; diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 4fe63b1..b96d2dc 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -119,6 +119,8 @@ enum { #define CKPT_HDR_TTY CKPT_HDR_TTY CKPT_HDR_TTY_LDISC, #define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC + CKPT_HDR_EPOLL_ITEMS, /* must be after file-table */ +#define CKPT_HDR_EPOLL_ITEMS CKPT_HDR_EPOLL_ITEMS CKPT_HDR_MM = 401, #define CKPT_HDR_MM CKPT_HDR_MM @@ -477,6 +479,8 @@ enum file_type { #define CKPT_FILE_SOCKET CKPT_FILE_SOCKET CKPT_FILE_TTY, #define CKPT_FILE_TTY CKPT_FILE_TTY + CKPT_FILE_EPOLL, +#define CKPT_FILE_EPOLL CKPT_FILE_EPOLL CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; @@ -693,6 +697,20 @@ struct ckpt_hdr_file_socket { __s32 sock_objref; } __attribute__((aligned(8))); +struct ckpt_hdr_eventpoll_items { + struct ckpt_hdr h; + __s32 epfile_objref; + __u32 num_items; +} __attribute__((aligned(8))); + +/* Contained in a CKPT_HDR_BUFFER following the ckpt_hdr_eventpoll_items */ +struct ckpt_eventpoll_item { + __u64 data; + __u32 fd; + __s32 file_objref; + __u32 events; +} __attribute__((aligned(8))); + /* memory layout */ struct ckpt_hdr_mm { struct ckpt_hdr h; diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index f6856a5..52282ae 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -56,6 +56,9 @@ struct file; #ifdef CONFIG_EPOLL +struct ckpt_ctx; +struct ckpt_hdr_file; + /* Used to initialize the epoll bits inside the "struct file" */ static inline void eventpoll_init_file(struct file *file) @@ -95,11 +98,23 @@ static inline void eventpoll_release(struct file *file) eventpoll_release_file(file); } -#else +#ifdef CONFIG_CHECKPOINT +extern struct file *ep_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *h); +#endif +#else +/* !defined(CONFIG_EPOLL) */ static inline void eventpoll_init_file(struct file *file) {} static inline void eventpoll_release(struct file *file) {} +#ifdef CONFIG_CHECKPOINT +static inline struct file *ep_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr) +{ + return ERR_PTR(-ENOSYS); +} +#endif #endif #endif /* #ifdef __KERNEL__ */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (14 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan @ 2010-03-19 0:59 ` Oren Laadan 2010-03-19 1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan 2010-03-19 1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 0:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger From: Matt Helsley <matthltc@us.ibm.com> Save/restore eventfd files. These are anon_inodes just like epoll but instead of a set of files to poll they are a 64-bit counter and a flag value. Used for AIO. [Oren Laadan] Added #ifdef's around checkpoint/restart to compile even without CONFIG_CHECKPOINT Changelog[v19]: - Fix broken compilation for architectures that don't support c/r Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 7 +++++ fs/eventfd.c | 55 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint_hdr.h | 8 ++++++ include/linux/eventfd.h | 12 ++++++++ 4 files changed, 82 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 6aaaf22..4b551fe 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -23,6 +23,7 @@ #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> #include <linux/eventpoll.h> +#include <linux/eventfd.h> #include <net/sock.h> @@ -644,6 +645,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_EPOLL, .restore = ep_file_restore, }, + /* eventfd */ + { + .file_name = "EVENTFD", + .file_type = CKPT_FILE_EVENTFD, + .restore = eventfd_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/eventfd.c b/fs/eventfd.c index 7758cc3..f2785c0 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -18,6 +18,7 @@ #include <linux/module.h> #include <linux/kref.h> #include <linux/eventfd.h> +#include <linux/checkpoint.h> struct eventfd_ctx { struct kref kref; @@ -287,11 +288,65 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c return res; } +#ifdef CONFIG_CHECKPOINT +static int eventfd_checkpoint(struct ckpt_ctx *ckpt_ctx, struct file *file) +{ + struct eventfd_ctx *ctx; + struct ckpt_hdr_file_eventfd *h; + int ret = -ENOMEM; + + h = ckpt_hdr_get_type(ckpt_ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + h->common.f_type = CKPT_FILE_EVENTFD; + ret = checkpoint_file_common(ckpt_ctx, file, &h->common); + if (ret < 0) + goto out; + ctx = file->private_data; + h->count = ctx->count; + h->flags = ctx->flags; + ret = ckpt_write_obj(ckpt_ctx, &h->common.h); +out: + ckpt_hdr_put(ckpt_ctx, h); + return ret; +} + +struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx, + struct ckpt_hdr_file *ptr) +{ + struct ckpt_hdr_file_eventfd *h = (struct ckpt_hdr_file_eventfd *) ptr; + struct file *evfile; + int evfd, ret; + + /* Already know type == CKPT_HDR_FILE and f_type == CKPT_FILE_EVENTFD */ + if (h->common.h.len != sizeof(*h)) + return ERR_PTR(-EINVAL); + + evfd = sys_eventfd2(h->count, h->flags); + if (evfd < 0) + return ERR_PTR(evfd); + evfile = fget(evfd); + sys_close(evfd); + if (!evfile) + return ERR_PTR(-EBUSY); + + ret = restore_file_common(ckpt_ctx, evfile, &h->common); + if (ret < 0) { + fput(evfile); + return ERR_PTR(ret); + } + return evfile; +} +#else +#define eventfd_checkpoint NULL +#endif + static const struct file_operations eventfd_fops = { .release = eventfd_release, .poll = eventfd_poll, .read = eventfd_read, .write = eventfd_write, + .checkpoint = eventfd_checkpoint, }; /** diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index b96d2dc..0b36430 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -481,6 +481,8 @@ enum file_type { #define CKPT_FILE_TTY CKPT_FILE_TTY CKPT_FILE_EPOLL, #define CKPT_FILE_EPOLL CKPT_FILE_EPOLL + CKPT_FILE_EVENTFD, +#define CKPT_FILE_EVENTFD CKPT_FILE_EVENTFD CKPT_FILE_MAX #define CKPT_FILE_MAX CKPT_FILE_MAX }; @@ -505,6 +507,12 @@ struct ckpt_hdr_file_pipe { __s32 pipe_objref; } __attribute__((aligned(8))); +struct ckpt_hdr_file_eventfd { + struct ckpt_hdr_file common; + __u64 count; + __u32 flags; +} __attribute__((aligned(8))); + /* socket */ struct ckpt_hdr_socket { struct ckpt_hdr h; diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 91bb4f2..2ce8525 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -39,6 +39,16 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt); int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, __u64 *cnt); +#ifdef CONFIG_CHECKPOINT +struct ckpt_ctx; +struct ckpt_hdr_file; + +struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx, + struct ckpt_hdr_file *ptr); +#else +#define eventfd_restore NULL +#endif + #else /* CONFIG_EVENTFD */ /* @@ -77,6 +87,8 @@ static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, return -ENOSYS; } +#define eventfd_restore NULL + #endif #endif /* _LINUX_EVENTFD_H */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (15 preceding siblings ...) 2010-03-19 0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan @ 2010-03-19 1:00 ` Oren Laadan 2010-03-19 1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 1:00 UTC (permalink / raw) To: linux-fsdevel Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan, Serge Hallyn Checkpoint and restore task->fs. Tasks sharing task->fs will share them again after restart. Original patch by Serge Hallyn <serue@us.ibm.com> Changelog: Jan 25: [orenl] Addressed comments by .. myself: - add leak detection - change order of save/restore of chroot and cwd - save/restore fs only after file-table and mm - rename functions to adapt existing conventions Dec 28: [serge] Addressed comments by Oren (and Dave) - define and use {get,put}_fs_struct helpers - fix locking comment - define ckpt_read_fname() and use in checkpoint/files.c Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Signed-off-by: Serge Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 203 +++++++++++++++++++++++++++++++++++++++- checkpoint/objhash.c | 34 +++++++ checkpoint/process.c | 17 ++++ fs/fs_struct.c | 21 ++++ fs/open.c | 58 +++++++----- include/linux/checkpoint.h | 8 ++- include/linux/checkpoint_hdr.h | 12 +++ include/linux/fs.h | 4 + include/linux/fs_struct.h | 2 + 9 files changed, 331 insertions(+), 28 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 4b551fe..7855bae 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -15,6 +15,9 @@ #include <linux/module.h> #include <linux/sched.h> #include <linux/file.h> +#include <linux/namei.h> +#include <linux/fs_struct.h> +#include <linux/fs.h> #include <linux/fdtable.h> #include <linux/fsnotify.h> #include <linux/pipe_fs_i.h> @@ -374,6 +377,62 @@ int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) return objref; } +int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct fs_struct *fs; + int fs_objref; + + task_lock(current); + fs = t->fs; + get_fs_struct(fs); + task_unlock(current); + + fs_objref = checkpoint_obj(ctx, fs, CKPT_OBJ_FS); + put_fs_struct(fs); + + return fs_objref; +} + +/* called with fs refcount bumped so it won't disappear */ +static int do_checkpoint_fs(struct ckpt_ctx *ctx, struct fs_struct *fs) +{ + struct ckpt_hdr_fs *h; + struct fs_struct *fscopy; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FS); + if (!h) + return -ENOMEM; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret) + return ret; + + fscopy = copy_fs_struct(fs); + if (!fs) + return -ENOMEM; + + ret = checkpoint_fname(ctx, &fscopy->pwd, &ctx->root_fs_path); + if (ret < 0) { + ckpt_err(ctx, ret, "%(T)writing path of cwd"); + goto out; + } + ret = checkpoint_fname(ctx, &fscopy->root, &ctx->root_fs_path); + if (ret < 0) { + ckpt_err(ctx, ret, "%(T)writing path of fs root"); + goto out; + } + ret = 0; + out: + free_fs_struct(fscopy); + return ret; +} + +int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_fs(ctx, (struct fs_struct *) ptr); +} + /*********************************************************************** * Collect */ @@ -460,10 +519,41 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } +int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct fs_struct *fs; + int ret; + + task_lock(t); + fs = t->fs; + get_fs_struct(fs); + task_unlock(t); + + ret = ckpt_obj_collect(ctx, fs, CKPT_OBJ_FS); + + put_fs_struct(fs); + return ret; +} + /************************************************************************** * Restart */ +static int ckpt_read_fname(struct ckpt_ctx *ctx, char **fname) +{ + int len; + + len = ckpt_read_payload(ctx, (void **) fname, + PATH_MAX, CKPT_HDR_FILE_NAME); + if (len < 0) + return len; + + (*fname)[len - 1] = '\0'; /* always play if safe */ + ckpt_debug("read filename '%s'\n", *fname); + + return len; +} + /** * restore_open_fname - read a file name and open a file * @ctx: checkpoint context @@ -479,11 +569,9 @@ struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags) if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC)) return ERR_PTR(-EINVAL); - len = ckpt_read_payload(ctx, (void **) &fname, - PATH_MAX, CKPT_HDR_FILE_NAME); + len = ckpt_read_fname(ctx, &fname); if (len < 0) return ERR_PTR(len); - fname[len - 1] = '\0'; /* always play if safe */ ckpt_debug("fname '%s' flags %#x\n", fname, flags); file = filp_open(fname, flags, 0); @@ -819,3 +907,112 @@ int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref) return 0; } + +/* + * Called by task restore code to set the restarted task's + * current->fs to an entry on the hash + */ +int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref) +{ + struct fs_struct *newfs, *oldfs; + + newfs = ckpt_obj_fetch(ctx, fs_objref, CKPT_OBJ_FS); + if (IS_ERR(newfs)) + return PTR_ERR(newfs); + + task_lock(current); + get_fs_struct(newfs); + oldfs = current->fs; + current->fs = newfs; + task_unlock(current); + put_fs_struct(oldfs); + + return 0; +} + +static int restore_chroot(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name) +{ + struct nameidata nd; + int ret; + + ckpt_debug("attempting chroot to %s\n", name); + ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd); + if (ret) { + ckpt_err(ctx, ret, "%(T)Opening chroot dir %s", name); + return ret; + } + ret = do_chroot(fs, &nd.path); + path_put(&nd.path); + if (ret) { + ckpt_err(ctx, ret, "%(T)Setting chroot %s", name); + return ret; + } + return 0; +} + +static int restore_cwd(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name) +{ + struct nameidata nd; + int ret; + + ckpt_debug("attempting chdir to %s\n", name); + ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd); + if (ret) { + ckpt_err(ctx, ret, "%(T)Opening cwd %s", name); + return ret; + } + ret = do_chdir(fs, &nd.path); + path_put(&nd.path); + if (ret) { + ckpt_err(ctx, ret, "%(T)Setting cwd %s", name); + return ret; + } + return 0; +} + +/* + * Called by objhash when it runs into a CKPT_OBJ_FS entry. Creates + * an fs_struct with desired chroot/cwd and places it in the hash. + */ +static struct fs_struct *do_restore_fs(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_fs *h; + struct fs_struct *fs; + char *path; + int ret = 0; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FS); + if (IS_ERR(h)) + return ERR_PTR(PTR_ERR(h)); + ckpt_hdr_put(ctx, h); + + fs = copy_fs_struct(current->fs); + if (!fs) + return ERR_PTR(-ENOMEM); + + ret = ckpt_read_fname(ctx, &path); + if (ret < 0) + goto out; + ret = restore_cwd(ctx, fs, path); + kfree(path); + if (ret) + goto out; + + ret = ckpt_read_fname(ctx, &path); + if (ret < 0) + goto out; + ret = restore_chroot(ctx, fs, path); + kfree(path); + +out: + if (ret) { + free_fs_struct(fs); + return ERR_PTR(ret); + } + return fs; +} + +void *restore_fs(struct ckpt_ctx *ctx) +{ + return (void *) do_restore_fs(ctx); +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 84bceec..5c4749d 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -15,6 +15,7 @@ #include <linux/hash.h> #include <linux/file.h> #include <linux/fdtable.h> +#include <linux/fs_struct.h> #include <linux/sched.h> #include <linux/ipc_namespace.h> #include <linux/user_namespace.h> @@ -126,6 +127,29 @@ static int obj_mm_users(void *ptr) return atomic_read(&((struct mm_struct *) ptr)->mm_users); } +static int obj_fs_grab(void *ptr) +{ + get_fs_struct((struct fs_struct *) ptr); + return 0; +} + +static void obj_fs_drop(void *ptr, int lastref) +{ + put_fs_struct((struct fs_struct *) ptr); +} + +static int obj_fs_users(void *ptr) +{ + /* + * It's safe to not use fs->lock because the fs referenced. + * It's also sufficient for leak detection: with no leak the + * count can't change; with a leak it will be too big already + * (even if it's about to grow), and if it's about to shrink + * then it's as if we sampled the count a bit earlier. + */ + return ((struct fs_struct *) ptr)->users; +} + static int obj_sighand_grab(void *ptr) { atomic_inc(&((struct sighand_struct *) ptr)->count); @@ -330,6 +354,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_mm, .restore = restore_mm, }, + /* fs object */ + { + .obj_name = "FS", + .obj_type = CKPT_OBJ_FS, + .ref_drop = obj_fs_drop, + .ref_grab = obj_fs_grab, + .ref_users = obj_fs_users, + .checkpoint = checkpoint_fs, + .restore = restore_fs, + }, /* sighand object */ { .obj_name = "SIGHAND", diff --git a/checkpoint/process.c b/checkpoint/process.c index e0ef795..f917112 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -232,6 +232,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) struct ckpt_hdr_task_objs *h; int files_objref; int mm_objref; + int fs_objref; int sighand_objref; int signal_objref; int first, ret; @@ -272,6 +273,13 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) return mm_objref; } + /* note: this must come *after* file-table and mm */ + fs_objref = checkpoint_obj_fs(ctx, t); + if (fs_objref < 0) { + ckpt_err(ctx, fs_objref, "%(T)process fs\n"); + return fs_objref; + } + sighand_objref = checkpoint_obj_sighand(ctx, t); ckpt_debug("sighand: objref %d\n", sighand_objref); if (sighand_objref < 0) { @@ -299,6 +307,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) return -ENOMEM; h->files_objref = files_objref; h->mm_objref = mm_objref; + h->fs_objref = fs_objref; h->sighand_objref = sighand_objref; h->signal_objref = signal_objref; ret = ckpt_write_obj(ctx, &h->h); @@ -477,6 +486,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) ret = ckpt_collect_mm(ctx, t); if (ret < 0) return ret; + ret = ckpt_collect_fs(ctx, t); + if (ret < 0) + return ret; ret = ckpt_collect_sighand(ctx, t); return ret; @@ -645,6 +657,11 @@ static int restore_task_objs(struct ckpt_ctx *ctx) if (ret < 0) goto out; + ret = restore_obj_fs(ctx, h->fs_objref); + ckpt_debug("fs: ret %d (%p)\n", ret, current->fs); + if (ret < 0) + return ret; + ret = restore_obj_sighand(ctx, h->sighand_objref); ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand); if (ret < 0) diff --git a/fs/fs_struct.c b/fs/fs_struct.c index eee0590..2a4c6f5 100644 --- a/fs/fs_struct.c +++ b/fs/fs_struct.c @@ -6,6 +6,27 @@ #include <linux/fs_struct.h> /* + * call with owning task locked + */ +void get_fs_struct(struct fs_struct *fs) +{ + write_lock(&fs->lock); + fs->users++; + write_unlock(&fs->lock); +} + +void put_fs_struct(struct fs_struct *fs) +{ + int kill; + + write_lock(&fs->lock); + kill = !--fs->users; + write_unlock(&fs->lock); + if (kill) + free_fs_struct(fs); +} + +/* * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values. * It can block. */ diff --git a/fs/open.c b/fs/open.c index 040cef7..62fc70c 100644 --- a/fs/open.c +++ b/fs/open.c @@ -527,6 +527,18 @@ SYSCALL_DEFINE2(access, const char __user *, filename, int, mode) return sys_faccessat(AT_FDCWD, filename, mode); } +int do_chdir(struct fs_struct *fs, struct path *path) +{ + int error; + + error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS); + if (error) + return error; + + set_fs_pwd(fs, path); + return 0; +} + SYSCALL_DEFINE1(chdir, const char __user *, filename) { struct path path; @@ -534,17 +546,10 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename) error = user_path_dir(filename, &path); if (error) - goto out; - - error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS); - if (error) - goto dput_and_out; - - set_fs_pwd(current->fs, &path); + return error; -dput_and_out: + error = do_chdir(current->fs, &path); path_put(&path); -out: return error; } @@ -574,31 +579,36 @@ out: return error; } -SYSCALL_DEFINE1(chroot, const char __user *, filename) +int do_chroot(struct fs_struct *fs, struct path *path) { - struct path path; int error; - error = user_path_dir(filename, &path); + error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS); if (error) - goto out; + return error; + + if (!capable(CAP_SYS_CHROOT)) + return -EPERM; - error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS); + error = security_path_chroot(path); if (error) - goto dput_and_out; + return error; - error = -EPERM; - if (!capable(CAP_SYS_CHROOT)) - goto dput_and_out; - error = security_path_chroot(&path); + set_fs_root(fs, path); + return 0; +} + +SYSCALL_DEFINE1(chroot, const char __user *, filename) +{ + struct path path; + int error; + + error = user_path_dir(filename, &path); if (error) - goto dput_and_out; + return error; - set_fs_root(current->fs, &path); - error = 0; -dput_and_out: + error = do_chroot(current->fs, &path); path_put(&path); -out: return error; } diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index ca91405..3e0937a 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -10,7 +10,7 @@ * distribution for more details. */ -#define CHECKPOINT_VERSION 3 +#define CHECKPOINT_VERSION 4 /* checkpoint user flags */ #define CHECKPOINT_SUBTREE 0x1 @@ -236,6 +236,12 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file, struct ckpt_hdr_file *h); +extern int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t); +extern int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref); +extern int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr); +extern void *restore_fs(struct ckpt_ctx *ctx); + /* credentials */ extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr); extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 0b36430..4dc852d 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -131,6 +131,9 @@ enum { CKPT_HDR_MM_CONTEXT, #define CKPT_HDR_MM_CONTEXT CKPT_HDR_MM_CONTEXT + CKPT_HDR_FS = 451, /* must be after file-table, mm */ +#define CKPT_HDR_FS CKPT_HDR_FS + CKPT_HDR_IPC = 501, #define CKPT_HDR_IPC CKPT_HDR_IPC CKPT_HDR_IPC_SHM, @@ -201,6 +204,8 @@ enum obj_type { #define CKPT_OBJ_FILE CKPT_OBJ_FILE CKPT_OBJ_MM, #define CKPT_OBJ_MM CKPT_OBJ_MM + CKPT_OBJ_FS, +#define CKPT_OBJ_FS CKPT_OBJ_FS CKPT_OBJ_SIGHAND, #define CKPT_OBJ_SIGHAND CKPT_OBJ_SIGHAND CKPT_OBJ_SIGNAL, @@ -416,6 +421,7 @@ struct ckpt_hdr_task_objs { __s32 files_objref; __s32 mm_objref; + __s32 fs_objref; __s32 sighand_objref; __s32 signal_objref; } __attribute__((aligned(8))); @@ -453,6 +459,12 @@ enum restart_block_type { }; /* file system */ +struct ckpt_hdr_fs { + struct ckpt_hdr h; + /* char *fs_root */ + /* char *fs_pwd */ +} __attribute__((aligned(8))); + struct ckpt_hdr_file_table { struct ckpt_hdr h; __s32 fdt_nfds; diff --git a/include/linux/fs.h b/include/linux/fs.h index 7902a51..a1525aa 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1818,6 +1818,10 @@ extern void drop_collected_mounts(struct vfsmount *); extern int vfs_statfs(struct dentry *, struct kstatfs *); +struct fs_struct; +extern int do_chdir(struct fs_struct *fs, struct path *path); +extern int do_chroot(struct fs_struct *fs, struct path *path); + extern int current_umask(void); /* /sys/fs */ diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h index 78a05bf..a73cbcb 100644 --- a/include/linux/fs_struct.h +++ b/include/linux/fs_struct.h @@ -20,5 +20,7 @@ extern struct fs_struct *copy_fs_struct(struct fs_struct *); extern void free_fs_struct(struct fs_struct *); extern void daemonize_fs_struct(void); extern int unshare_fs_struct(void); +extern void get_fs_struct(struct fs_struct *); +extern void put_fs_struct(struct fs_struct *); #endif /* _LINUX_FS_STRUCT_H */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace 2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan ` (16 preceding siblings ...) 2010-03-19 1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan @ 2010-03-19 1:00 ` Oren Laadan 17 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-19 1:00 UTC (permalink / raw) To: linux-fsdevel Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan, Serge E. Hallyn We only allow c/r when all processes shared a single mounts ns. We do intend to implement c/r of mounts and mounts namespaces in the kernel. It shouldn't be ugly or complicate locking to do so. Just haven't gotten around to it. A more complete solution is more than we want to take on now for v19. But we'd like as much as possible for everything which we don't support, to not be checkpointable, since not doing so has in the past invited slanderous accusations of being a toy implementation :) Meanwhile, we get the following: 1) Checkpoint bails if not all tasks share the same mnt-ns 2) Leak detection works for full container checkpoint On restart, all tasks inherit the same mnt-ns of the coordinator, by default. A follow-up patch to user-cr will add a new switch to the 'restart' to request a CLONE_NEWMNT flag when creating the root-task of the restart. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/objhash.c | 25 +++++++++++++++++++++++++ include/linux/checkpoint.h | 2 +- include/linux/checkpoint_hdr.h | 4 ++++ kernel/nsproxy.c | 16 +++++++++++++--- 4 files changed, 43 insertions(+), 4 deletions(-) diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 5c4749d..42998b2 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -19,6 +19,7 @@ #include <linux/sched.h> #include <linux/ipc_namespace.h> #include <linux/user_namespace.h> +#include <linux/mnt_namespace.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> #include <net/sock.h> @@ -214,6 +215,22 @@ static int obj_ipc_ns_users(void *ptr) return atomic_read(&((struct ipc_namespace *) ptr)->count); } +static int obj_mnt_ns_grab(void *ptr) +{ + get_mnt_ns((struct mnt_namespace *) ptr); + return 0; +} + +static void obj_mnt_ns_drop(void *ptr, int lastref) +{ + put_mnt_ns((struct mnt_namespace *) ptr); +} + +static int obj_mnt_ns_users(void *ptr) +{ + return atomic_read(&((struct mnt_namespace *) ptr)->count); +} + static int obj_cred_grab(void *ptr) { get_cred((struct cred *) ptr); @@ -411,6 +428,14 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_ipc_ns, .restore = restore_ipc_ns, }, + /* mnt_ns object */ + { + .obj_name = "MOUNTS NS", + .obj_type = CKPT_OBJ_MNT_NS, + .ref_grab = obj_mnt_ns_grab, + .ref_drop = obj_mnt_ns_drop, + .ref_users = obj_mnt_ns_users, + }, /* user_ns object */ { .obj_name = "USER_NS", diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 3e0937a..64b4b8a 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -10,7 +10,7 @@ * distribution for more details. */ -#define CHECKPOINT_VERSION 4 +#define CHECKPOINT_VERSION 5 /* checkpoint user flags */ #define CHECKPOINT_SUBTREE 0x1 diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 4dc852d..28dfc36 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -90,6 +90,8 @@ enum { #define CKPT_HDR_UTS_NS CKPT_HDR_UTS_NS CKPT_HDR_IPC_NS, #define CKPT_HDR_IPC_NS CKPT_HDR_IPC_NS + CKPT_HDR_MNT_NS, +#define CKPT_HDR_MNT_NS CKPT_HDR_MNT_NS CKPT_HDR_CAPABILITIES, #define CKPT_HDR_CAPABILITIES CKPT_HDR_CAPABILITIES CKPT_HDR_USER_NS, @@ -216,6 +218,8 @@ enum obj_type { #define CKPT_OBJ_UTS_NS CKPT_OBJ_UTS_NS CKPT_OBJ_IPC_NS, #define CKPT_OBJ_IPC_NS CKPT_OBJ_IPC_NS + CKPT_OBJ_MNT_NS, +#define CKPT_OBJ_MNT_NS CKPT_OBJ_MNT_NS CKPT_OBJ_USER_NS, #define CKPT_OBJ_USER_NS CKPT_OBJ_USER_NS CKPT_OBJ_CRED, diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 17b048e..0da0d83 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -255,10 +255,17 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t) * ipc_ns (shm) may keep references to files: if this is the * first time we see this ipc_ns (ret > 0), proceed inside. */ - if (ret) + if (ret) { ret = ckpt_collect_ipc_ns(ctx, nsproxy->ipc_ns); + if (ret < 0) + goto out; + } - /* TODO: collect other namespaces here */ + ret = ckpt_obj_collect(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS); + if (ret < 0) + goto out; + + ret = 0; out: put_nsproxy(nsproxy); return ret; @@ -282,7 +289,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) goto out; h->ipc_objref = ret; - /* TODO: Write other namespaces here */ + /* FIXME: for now, only marked visited to pacify leaks */ + ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS); + if (ret < 0) + goto out; ret = ckpt_write_obj(ctx, &h->h); out: -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20
@ 2010-03-17 16:07 Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 01/96] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
0 siblings, 1 reply; 88+ messages in thread
From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar,
containers, Oren Laadan
Hi Andrew,
Following up on the thread on the checkpoint-restart patch set
(http://lkml.org/lkml/2010/3/1/422), the following series is the
latest checkpoint/restart, based on 2.6.33.
The first 20 patches are cleanups and prepartion for c/r; they
are followed by the actual c/r code.
Please apply to -mm, and let us know if there is any way we can
help.
Thanks,
Oren.
---
Linux Checkpoint-Restart:
web, wiki: http://www.linux-cr.org
bug track: https://www.linux-cr.org/redmine
The repositories for the project are in:
kernel: http://www.linux-cr.org/git/?p=linux-cr.git;a=summary
user tools: http://www.linux-cr.org/git/?p=user-cr.git;a=summary
tests suite: http://www.linux-cr.org/git/?p=tests-cr.git;a=summary
---
CHANGELOG:
v20 [2010-Mar-16]
BUG FIXES (only)
- [Serge Hallyn] Fix unlabeled restore case
- [Serge Hallyn] Always restore msg_msg label
- [Serge Hallyn] Selinux prevents msgrcv on restore message queues?
- [Serge Hallyn] save_access_regs for self-checkpoint
- [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages
- Fix "scheduling in atomic" while restoring ipc (sem, shm, msg)
- Cleanup: no need to restore perm->{id,key,seq}
- Fix sysvipc=n compile
- Make uts_ns=n compile
- Only use arch_setup_additional_pages() if supported by arch
- Export key symbols to enable c/r from kernel modules
- Avoid crash if incoming object doesn't have .restore
- Replace error_sem with an event completion
- [Serge Hallyn] Change sysctl and default for unprivileged use
- [Nathan Lynch] Use syscall_get_error
- Add entry for checkpoint/restart in MAINTAINERS
[2010-Feb-19] v19
NEW FEATURES
- Support for x86-64 architecture
- Support for c/r of LSM (smack, selinux)
- Support for c/r of task fs_root and pwd
- Support for c/r of epoll
- Support for c/r of eventfd
- Enable C/R while executing over NFS
- Preliminary c/r of mounts namespace
- Add @logfd argument to sys_{checkpoint,restart} prototypes
- Define new api for error and debug logging
- Restart to handle checkpoint images lacking {uts,ipc}-ns
- Refuse to checkpoint if monitoring directories with dnotify
- Refuse to checkpoint if file locks and leases are held
- Refuse to checkpoint files with f_owner
OTHER CHANGES
- Rebase to kernel 2.6.33-rc8
- Settled version of new sys_eclone()
- [Serge Hallyn] Fix potential use-before-set return (vdso)
- Update documentation and examples for new syscalls API (doc)
- [Liu Alexander] Fix typos (doc)
- [Serge Hallyn] Update checkpoint image format (doc)
- [Serge Hallyn] Use ckpt_err() to for bad header values
- sys_{checkpoint,restart} to use ptregs prototype
- Set ctx->errno in do_ckpt_msg() if needed
- Fix up headers so we can munge them for use by userspace
- Multiple fixes to _ckpt_write_err() and friends
- [Matt Helsley] Add cpp definitions for enums
- [Serge Hallyn] Add global section container to image format
- [Matt Helsley] Fix total byte read/write count for large images
- ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr)
- [Serge Hallyn] Use ckpt_err() for arch incompatbilities
- Introduce walk_task_subtree() to iterate through descendants
- Call restore_notify_error for restart (not checkpoint !)
- Make kread/kwrite() abort if CKPT_CTX_ERROR is set
- [Serge Hallyn] Move init_completion(&ctx->complete) to ctx_alloc
- Simplify logic of tracking restarting tasks (->ctx)
- Coordinator kills descendants on failure for proper cleanup
- Prepare descendants needs PTRACE_MODE_ATTACH permissions
- Threads wait for entire thread group before restoring
- Add debug process-tree status during restart
- Fix handling of bogus pid arg to sys_restart
- In reparent_thread() test for PF_RESTARTING on parent
- Keep __u32s in even groups for 32-64 bit compatibility
- Define ckpt_obj_try_fetch
- Disallow zero or negative objref during restart
- Check for valid destructor before calling it (deferqueue)
- Fix false negative of test for unlinked files at checkpoint
- [Serge Hallyn] Rename fs_mnt to root_fs_path
- Restore thread/cpu state early
- Ensure null-termination of file names read from image
- Fix compile warning in restore_open_fname()
- Introduce FOLL_DIRTY to follow_page() for "dirty" pages
- [Serge Hallyn] Checkpoint saved_auxv as u64s
- Export filemap_checkpoint()
- [Serge Hallyn] Disallow checkpoint of tasks with aio requests
- Fix compilation failure when !CONFIG_CHEKCPOINT (regression)
- Expose page write functions
- Do not hold mmap_sem while checkpointing vma's
- Do not hold mmap_sem when reading memory pages on restart
- Move consider_private_page() to mm/memory.c:__get_dirty_page()
- [Serge Hallyn] move destroy_mm into mmap.c and remove size check
- [Serge Hallyn] fill vdso (syscall32_setup_pages) for TIF_IA32/x86_64
- [Serge Hallyn] Fix return value of read_pages_contents()
- [Serge Hallyn] Change m_type to long, not int (ipc)
- Don't free sma if it's an error on restore
- Use task->saves_sigmask and drop task->checkpoint_data
- [Serge Hallyn] Handle saved_sigmask at checkpoint
- Defer restore of blocked signals mask during restart
- Self-restart to tolerate missing PGIDs
- [Serge Hallyn] skb->tail can be offset
- Export and leverage sock_alloc_file()
- [Nathan Lynch] Fix net/checkpoint.c for 64-bit
- [Dan Smith] Unify skb read/write functions and handle fragmented buffers
- [Dan Smith] Update buffer restore code to match the new format
- [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n
- [Dan Smith] Remove an unnecessary check on socket restart
- [Dan Smith] Pass the stored sock->protocol into sock_create() on restore
- Relax tcp.window_clamp value in INET restore
- Restore gso_type fields on sockets and buffers for proper operation
- Fix broken compilation for no-c/r architectures
- Return -EBUSY (not BUG_ON) if fd is gone on restart
- Fix the chunk size instead of auto-tune (epoll)
ARCH: x86 (32,64)
- Use PTREGSCALL4 for sys_{checkpoint,restart}
- Remove debug-reg support (need to redo with perf_events)
- [Serge Hallyn] Support for ia32 (checkpoint, restart)
- Split arch/x86/checkpoint.c to generic and 32bit specific parts
- sys_{checkpoint,restore} to use ptregs
- Allow X86_EFLAGS_RF on restart
- [Serge Hallyn] Only allow 'restart' with same bit-ness as image.
- Move checkpoint.c from arch/x86/mm->arch/x86/kernel
ARCH: s390 [Serge Hallyn]
- Define s390x sys_restart wrapper
- Fixes to restart-blocks logic and signal path
- Fix checkpoint and restart compat wrappers
- sys_{checkpoint,restore} to use ptregs
- Use simpler test_task_thread to test current ti flags
- Fix 31-bit s390 checkpoint/restart wrappers
- Update sys_checkpoint (do_sys_checkpoint on all archs)
- [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel
ARCH: powerpc [Nathan Lynch]
- [Serge Hallyn] Add hook task_has_saved_sigmask()
- Warn if full register state unavailable
- Fix up checkpoint syscall, tidy restart
- [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm->kernel}
[2009-Sep-22] v18
NEW FEATURES
- [Nathan Lynch] Re-introduce powerpc support
- Save/restore pseudo-terminals
- Save/restore (pty) controlling terminals
- Save/restore restore PGIDs
- [Dan Smith] Save/restore unix domain sockets
- Save/restore FIFOs
- Save/restore pending signals
- Save/restore rlimits
- Save/restore itimers
- [Matt Helsley] Handle many non-pseudo file-systems
OTHER CHANGES
- Rename headerless struct ckpt_hdr_* to struct ckpt_*
- [Nathan Lynch] discard const from struct cred * where appropriate
- [Serge Hallyn][s390] Set return value for self-checkpoint
- Handle kmalloc failure in restore_sem_array()
- [IPC] Collect files used by shm objects
- [IPC] Use file (not inode) as shared object on checkpoint of shm
- More ckpt_write_err()s to give information on checkpoint failure
- Adjust format of pipe buffer to include the mandatory pre-header
- [LEAKS] Mark the backing file as visited at chekcpoint
- Tighten checks on supported vma to checkpoint or restart
- [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
- Introduce ckpt_collect_file() that also uses file->collect method
- Use ckpt_collect_file() instead of ckpt_obj_collect() for files
- Fix leak-detection issue in collect_mm() (test for first-time obj)
- Invoke set_close_on_exec() unconditionally on restart
- [Dan Smith] Export fill_fname() as ckpt_fill_fname()
- Interface to pass simple pointers as data with deferqueue
- [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
- Replace EAGAIN with EBUSY where necessary
- Introduce CKPT_OBJ_VISITED in leak detection
- ckpt_obj_collect() returns objref for new objects, 0 otherwise
- Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
- Introduce ckpt_obj_visit() to mark objects as visited
- Set the CHECKPOINTED flag on objects before calling checkpoint
- Introduce ckpt_obj_reserve()
- Change ref_drop() to accept a @lastref argument (for cleanup)
- Disallow multiple objects with same objref in restart
- Allow _ckpt_read_obj_type() to read header only (w/o payload)
- Fix leak of ckpt_ctx when restoring zombie tasks
- Fix race of prepare_descendant() with an ongoing fork()
- Track and report the first error if restart fails
- Tighten logic to protect against bogus pids in input
- [Matt Helsley] Improve debug output from ckpt_notify_error()
- [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
- Detect error-headers in input data on restart, and abort.
- Standard format for checkpoint error strings (and documentation)
- [Dan Smith] Add an errno validation function
- Add ckpt_read_payload(): read a variable-length object (no header)
- Add ckpt_read_string(): same for strings (ensures null-terminated)
- Add ckpt_read_consume(): consumes next object without processing
- [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
[2009-Jul-21] v17
- Introduce syscall clone_with_pids() to restore original pids
- Support threads and zombies
- Save/restore task->files
- Save/restore task->sighand
- Save/restore futex
- Save/restore credentials
- Introduce PF_RESTARTING to skip notifications on task exit
- restart(2) allow caller to ask to freeze tasks after restart
- restart(2) isn't idempotent: return -EINTR if interrupted
- Improve debugging output handling
- Make multi-process restart logic more robust and complete
- Correctly select return value for restarting tasks on success
- Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
- Use CHECKPOINTING state for frozen checkpointed tasks
- Fix compilation without CONFIG_CHECKPOINT
- Fix compilation with CONFIG_COMPAT
- Fix headers includes and exports
- Leak detection performed in two steps
- Detect "inverse" leaks of objects (dis)appearing unexpectedly
- Memory: save/restore mm->{flags,def_flags,saved_auxv}
- Memory: only collect sub-objects of mm once (leak detection)
- Files: validate f_mode after restore
- Namespaces: leak detection for nsproxy sub-components
- Namespaces: proper restart from namespace(s) without namespace(s)
- Save global constants in header instead of per-object
- IPC: replace sys_unshare() with create_ipc_ns()
- IPC: restore objects in suitable namespace
- IPC: correct behavior under !CONFIG_IPC_NS
- UTS: save/restore all fields
- UTS: replace sys_unshare() with create_uts_ns()
- X86_32: sanitize cpu, debug, and segment registers on restart
- cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
- cgroup_freezer: add interface to freeze a cgroup (given a task)
[2009-May-27] v16
- Privilege checks for IPC checkpoint
- Fix error string generation during checkpoint
- Use kzalloc for header allocation
- Restart blocks are arch-independent
- Redo pipe c/r using splice
- Fixes to s390 arch
- Remove powerpc arch (temporary)
- Explicitly restore ->nsproxy
- All objects in image are precedeed by 'struct ckpt_hdr'
- Fix leaks detection (and leaks)
- Reorder of patchset
- Misc bugs and compilation fixes
[2009-Apr-12] v15
- Minor fixes
[2009-Apr-28] v14
- Tested against kernel v2.6.30-rc3 on x86_32.
- Refactor files chekpoint to use f_ops (file operations)
- Refactor mm/vma to use vma_ops
- Explicitly handle VDSO vma (and require compat mode)
- Added code to c/r restat-blocks (restart timeout related syscalls)
- Added code to c/r namespaces: uts, ipc (with Dan Smith)
- Added code to c/r sysvipc (shm, msg, sem)
- Support for VM_CLONE shared memory
- Added resource leak detection for whole-container checkpoint
- Added sysctl gauge to allow unprivileged restart/checkpoint
- Improve and simplify the code and logic of shared objects
- Rework image format: shared objects appear prior to their use
- Merge checkpoint and restart functionality into same files
- Massive renaming of functions: prefix "ckpt_" for generics,
"checkpoint_" for checkpoint, and "restore_" for restart.
- Report checkpoint errors as a valid (string record) in the output
- Merged PPC architecture (by Nathan Lunch),
- Requires updates to userspace tools too.
- Misc nits and bug fixes
[2009-Mar-31] v14-rc2
- Change along Dave's suggestion to use f_ops->checkpoint() for files
- Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
- Merge support for PPC arch (Nathan Lynch)
- Misc cleanups and fixes in response to comments
[2009-Mar-20] v14-rc1:
- The 'h.parent' field of 'struct cr_hdr' isn't used - discard
- Check whether calls to cr_hbuf_get() succeed or fail.
- Fixed of pipe c/r code
- Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
- Refuse non-self checkpoint if a task isn't frozen
- Use unsigned fields in checkpoint headers unless otherwise required
- Rename functions in files c/r to better reflect their role
- Add support for anonymous shared memory
- Merge support for s390 arch (Dan Smith, Serge Hallyn)
[2008-Dec-03] v13:
- Cleanups of 'struct cr_ctx' - remove unused fields
- Misc fixes for comments
[2008-Dec-17] v12:
- Fix re-alloc/reset of pgarr chain to correctly reuse buffers
(empty pgarr are saves in a separate pool chain)
- Add a couple of missed calls to cr_hbuf_put()
- cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
- Split cr_write/cr_read() to two parts: _cr_write/read() helper
- Befriend with sparse: explicit conversion to 'void __user *'
- Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
[2008-Dec-05] v11:
- Use contents of 'init->fs->root' instead of pointing to it
- Ignore symlinks (there is no such thing as an open symlink)
- cr_scan_fds() retries from scratch if it hits size limits
- Add missing test for VM_MAYSHARE when dumping memory
- Improve documentation about: behavior when tasks aren't fronen,
life span of the object hash, references to objects in the hash
[2008-Nov-26] v10:
- Grab vfs root of container init, rather than current process
- Acquire dcache_lock around call to __d_path() in cr_fill_name()
- Force end-of-string in cr_read_string() (fix possible DoS)
- Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
[2008-Nov-10] v9:
- Support multiple processes c/r
- Extend checkpoint header with archtiecture dependent header
- Misc bug fixes (see individual changelogs)
- Rebase to v2.6.28-rc3.
[2008-Oct-29] v8:
- Support "external" checkpoint
- Include Dave Hansen's 'deny-checkpoint' patch
- Split docs in Documentation/checkpoint/..., and improve contents
[2008-Oct-17] v7:
- Fix save/restore state of FPU
- Fix argument given to kunmap_atomic() in memory dump/restore
[2008-Oct-07] v6:
- Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
(even though it's not really needed)
- Add assumptions and what's-missing to documentation
- Misc fixes and cleanups
[2008-Sep-11] v5:
- Config is now 'def_bool n' by default
- Improve memory dump/restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Fix use of follow_page() to avoid faulting in non-present pages
- Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()
- Remove preempt_disable() when restoring debug registers
- Rename headers files s/ckpt/checkpoint/
- Fix misc bugs in files dump/restore
- Fixes and cleanups on some error paths
- Fix misc coding style
[2008-Sep-09] v4:
- Various fixes and clean-ups
- Fix calculation of hash table size
- Fix header structure alignment
- Use stand list_... for cr_pgarr
[2008-Aug-29] v3:
- Various fixes and clean-ups
- Use standard hlist_... for hash table
- Better use of standard kmalloc/kfree
[2008-Aug-20] v2:
- Added Dump and restore of open files (regular and directories)
- Added basic handling of shared objects, and improve handling of
'parent tag' concept
- Added documentation
- Improved ABI, 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup and response to comments for v1
[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 88+ messages in thread* [C/R v20][PATCH 01/96] eclone (1/11): Factor out code to allocate pidmap page 2010-03-17 16:07 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 02/96] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> To simplify alloc_pidmap(), move code to allocate a pid map page to a separate function. Changelog[v4]: - [Oren Laadan] Adapt to kernel 2.6.33-rc5 Changelog[v3]: - Earlier version of patchset called alloc_pidmap_page() from two places. But now its called from only one place. Even so, moving this code out into a separate function simplifies alloc_pidmap(). Changelog[v2]: - (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return -ENOMEM on error instead of -1. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> --- kernel/pid.c | 41 ++++++++++++++++++++++++++--------------- 1 files changed, 26 insertions(+), 15 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index 2e17c9c..39292e6 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,30 @@ static void free_pidmap(struct upid *upid) atomic_inc(&map->nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map->page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + /* + * Free the page if someone raced with us installing it: + */ + spin_lock_irq(&pidmap_lock); + if (!map->page) { + map->page = page; + page = NULL; + } + spin_unlock_irq(&pidmap_lock); + kfree(page); + if (unlikely(!map->page)) + return -1; + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns->last_pid; @@ -134,22 +158,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i <= max_scan; ++i) { - if (unlikely(!map->page)) { - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); - /* - * Free the page if someone raced with us - * installing it: - */ - spin_lock_irq(&pidmap_lock); - if (!map->page) { - map->page = page; - page = NULL; - } - spin_unlock_irq(&pidmap_lock); - kfree(page); - if (unlikely(!map->page)) + if (unlikely(!map->page)) + if (alloc_pidmap_page(map) < 0) break; - } if (likely(atomic_read(&map->nr_free))) { do { if (!test_and_set_bit(offset, map->page)) { -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 02/96] eclone (2/11): Have alloc_pidmap() return actual error code 2010-03-17 16:07 ` [C/R v20][PATCH 01/96] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 03/96] eclone (3/11): Define set_pidmap() function Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> alloc_pidmap() can fail either because all pid numbers are in use or because memory allocation failed. With support for setting a specific pid number, alloc_pidmap() would also fail if either the given pid number is invalid or in use. Rather than have callers assume -ENOMEM, have alloc_pidmap() return the actual error. Changelog[v1]: - [Oren Laadan] Rebase to kernel 2.6.33 Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> --- kernel/fork.c | 5 +++-- kernel/pid.c | 10 ++++++---- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index f88bd98..e9cf524 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1167,10 +1167,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - retval = -ENOMEM; pid = alloc_pid(p->nsproxy->pid_ns); - if (!pid) + if (IS_ERR(pid)) { + retval = PTR_ERR(pid); goto bad_fork_cleanup_io; + } if (clone_flags & CLONE_NEWPID) { retval = pid_ns_prepare_proc(p->nsproxy->pid_ns); diff --git a/kernel/pid.c b/kernel/pid.c index 39292e6..252babf 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -160,7 +160,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) for (i = 0; i <= max_scan; ++i) { if (unlikely(!map->page)) if (alloc_pidmap_page(map) < 0) - break; + return -ENOMEM; if (likely(atomic_read(&map->nr_free))) { do { if (!test_and_set_bit(offset, map->page)) { @@ -191,7 +191,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) } pid = mk_pid(pid_ns, map, offset); } - return -1; + return -EBUSY; } int next_pidmap(struct pid_namespace *pid_ns, int last) @@ -260,8 +260,10 @@ struct pid *alloc_pid(struct pid_namespace *ns) struct upid *upid; pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); - if (!pid) + if (!pid) { + pid = ERR_PTR(-ENOMEM); goto out; + } tmp = ns; for (i = ns->level; i >= 0; i--) { @@ -295,7 +297,7 @@ out_free: free_pidmap(pid->numbers + i); kmem_cache_free(ns->pid_cachep, pid); - pid = NULL; + pid = ERR_PTR(nr); goto out; } -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 03/96] eclone (3/11): Define set_pidmap() function 2010-03-17 16:07 ` [C/R v20][PATCH 02/96] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 04/96] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Define a set_pidmap() interface which is like alloc_pidmap() only that caller specifies the pid number to be assigned. Changelog[v13]: - Don't let do_alloc_pidmap return 0 if it failed to find a pid. Changelog[v9]: - Completely rewrote this patch based on Eric Biederman's code. Changelog[v7]: - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids. Changelog[v6]: - Separate target_pid > 0 case to minimize the number of checks needed. Changelog[v3]: - (Eric Biederman): Avoid set_pidmap() function. Added couple of checks for target_pid in alloc_pidmap() itself. Changelog[v2]: - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code actually checks for 'pid <= 0' for completeness). Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> --- kernel/pid.c | 41 +++++++++++++++++++++++++++++++++-------- 1 files changed, 33 insertions(+), 8 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index 252babf..1f15bb6 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -146,17 +146,18 @@ static int alloc_pidmap_page(struct pidmap *map) return 0; } -static int alloc_pidmap(struct pid_namespace *pid_ns) +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min, + int max) { - int i, offset, max_scan, pid, last = pid_ns->last_pid; + int i, offset, max_scan, pid; struct pidmap *map; pid = last + 1; if (pid >= pid_max) - pid = RESERVED_PIDS; + pid = min; offset = pid & BITS_PER_PAGE_MASK; map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; - max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; + max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i <= max_scan; ++i) { if (unlikely(!map->page)) if (alloc_pidmap_page(map) < 0) @@ -165,7 +166,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) do { if (!test_and_set_bit(offset, map->page)) { atomic_dec(&map->nr_free); - pid_ns->last_pid = pid; return pid; } offset = find_next_offset(map, offset); @@ -176,16 +176,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) * bitmap block and the final block was the same * as the starting point, pid is before last_pid. */ - } while (offset < BITS_PER_PAGE && pid < pid_max && + } while (offset < BITS_PER_PAGE && pid < max && (i != max_scan || pid < last || !((last+1) & BITS_PER_PAGE_MASK))); } - if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) { + if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) { ++map; offset = 0; } else { map = &pid_ns->pidmap[0]; - offset = RESERVED_PIDS; + offset = min; if (unlikely(last == offset)) break; } @@ -194,6 +194,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) return -EBUSY; } +static int alloc_pidmap(struct pid_namespace *pid_ns) +{ + int nr; + + nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max); + if (nr >= 0) + pid_ns->last_pid = nr; + return nr; +} + +static int set_pidmap(struct pid_namespace *pid_ns, int target) +{ + if (!target) + return alloc_pidmap(pid_ns); + + if (target >= pid_max) + return -EINVAL; + + if ((target < 0) || (target < RESERVED_PIDS && + pid_ns->last_pid >= RESERVED_PIDS)) + return -EINVAL; + + return do_alloc_pidmap(pid_ns, target - 1, target, target + 1); +} + int next_pidmap(struct pid_namespace *pid_ns, int last) { int offset; -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 04/96] eclone (4/11): Add target_pids parameter to alloc_pid() 2010-03-17 16:07 ` [C/R v20][PATCH 03/96] eclone (3/11): Define set_pidmap() function Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 05/96] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> This parameter is currently NULL, but will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> --- include/linux/pid.h | 2 +- kernel/fork.c | 3 ++- kernel/pid.c | 9 +++++++-- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 49f1c2f..914185d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); int next_pidmap(struct pid_namespace *pid_ns, int last); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); extern void free_pid(struct pid *pid); /* diff --git a/kernel/fork.c b/kernel/fork.c index e9cf524..2e10cb8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -985,6 +985,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, int retval; struct task_struct *p; int cgroup_callbacks_done = 0; + pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1167,7 +1168,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - pid = alloc_pid(p->nsproxy->pid_ns); + pid = alloc_pid(p->nsproxy->pid_ns, target_pids); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_io; diff --git a/kernel/pid.c b/kernel/pid.c index 1f15bb6..b0d7fc9 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -276,13 +276,14 @@ void free_pid(struct pid *pid) call_rcu(&pid->rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; + pid_t tpid; pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); if (!pid) { @@ -292,7 +293,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns->level; i >= 0; i--) { - nr = alloc_pidmap(tmp); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = set_pidmap(tmp, tpid); if (nr < 0) goto out_free; -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 05/96] eclone (5/11): Add target_pids parameter to copy_process() 2010-03-17 16:07 ` [C/R v20][PATCH 04/96] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 06/96] eclone (6/11): Check invalid clone flags Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Add a 'target_pids' parameter to copy_process(). The new parameter will be used in a follow-on patch when eclone() is implemented. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> --- kernel/fork.c | 7 ++++--- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 2e10cb8..737bca9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -980,12 +980,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1359,7 +1359,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL, - &init_struct_pid, 0); + &init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1382,6 +1382,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1422,7 +1423,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, - child_tidptr, NULL, trace); + child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 06/96] eclone (6/11): Check invalid clone flags 2010-03-17 16:07 ` [C/R v20][PATCH 05/96] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 07/96] eclone (7/11): Define do_fork_with_pids() Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> As pointed out by Oren Laadan, we want to ensure that unused bits in the clone-flags remain unused and available for future. To ensure this, define a mask of clone-flags and check the flags in the clone() system calls. Changelog[v9]: - Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS to avoid breaking any applications that may have set it. IOW, this patch/check only applies to clone-flags bits 33 and higher. Changelog[v8]: - New patch in set Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl.cs.columbia.edu> --- include/linux/sched.h | 12 ++++++++++++ kernel/fork.c | 3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 78efe7c..d57eab8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -29,6 +29,18 @@ #define CLONE_NEWNET 0x40000000 /* New network namespace */ #define CLONE_IO 0x80000000 /* Clone io context */ +#define CLONE_UNUSED 0x00001000 /* Can be reused ? */ + +#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\ + CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\ + CLONE_VFORK | CLONE_PARENT | CLONE_THREAD |\ + CLONE_NEWNS | CLONE_SYSVSEM | CLONE_SETTLS |\ + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\ + CLONE_DETACHED | CLONE_UNTRACED |\ + CLONE_CHILD_SETTID | CLONE_STOPPED |\ + CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\ + CLONE_NEWPID | CLONE_NEWNET | CLONE_IO) + /* * Scheduling policies */ diff --git a/kernel/fork.c b/kernel/fork.c index 737bca9..f95cbd2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -987,6 +987,9 @@ static struct task_struct *copy_process(unsigned long clone_flags, struct task_struct *p; int cgroup_callbacks_done = 0; + if (clone_flags & ~VALID_CLONE_FLAGS) + return ERR_PTR(-EINVAL); + if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 07/96] eclone (7/11): Define do_fork_with_pids() 2010-03-17 16:07 ` [C/R v20][PATCH 06/96] eclone (6/11): Check invalid clone flags Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 08/96] eclone (8/11): Implement sys_eclone for x86 (32,64) Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> do_fork_with_pids() is same as do_fork(), except that it takes an additional, 'pid_set', parameter. This parameter, currently unused, specifies the set of target pids of the process in each of its pid namespaces. Changelog[v7]: - Drop 'struct pid_set' object and pass in 'pid_t *target_pids' instead of 'struct pid_set *'. Changelog[v6]: - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' is constant across architectures. - (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'. Changelog[v4]: - Rename 'struct target_pid_set' to 'struct pid_set' since it may be useful in other contexts. Changelog[v3]: - Fix "long-line" warning from checkpatch.pl Changelog[v2]: - To facilitate moving architecture-inpdendent code to kernel/fork.c pass in 'struct target_pid_set __user *' to do_fork_with_pids() rather than 'pid_t *' (next patch moves the arch-independent code to kernel/fork.c) Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> --- include/linux/sched.h | 3 +++ kernel/fork.c | 17 +++++++++++++++-- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index d57eab8..4f079f7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2189,6 +2189,9 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, + unsigned long, int __user *, int __user *, + unsigned int, pid_t __user *); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/kernel/fork.c b/kernel/fork.c index f95cbd2..fb92128 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1375,12 +1375,14 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + unsigned int num_pids, + pid_t __user *upids) { struct task_struct *p; int trace = 0; @@ -1483,6 +1485,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, 0, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 08/96] eclone (8/11): Implement sys_eclone for x86 (32,64) 2010-03-17 16:07 ` [C/R v20][PATCH 07/96] eclone (7/11): Define do_fork_with_pids() Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 09/96] eclone (9/11): Implement sys_eclone for s390 Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Container restart requires that a task have the same pid it had when it was checkpointed. When containers are nested the tasks within the containers exist in multiple pid namespaces and hence have multiple pids to specify during restart. eclone(), intended for use during restart, is the same as clone(), except that it takes a 'pids' paramter. This parameter lets caller choose specific pid numbers for the child process, in the process's active and ancestor pid namespaces. (Descendant pid namespaces in general don't matter since processes don't have pids in them anyway, but see comments in copy_target_pids() regarding CLONE_NEWPID). eclone() also attempts to address a second limitation of the clone() system call. clone() is restricted to 32 clone flags and all but one of these are in use. If more new clone flags are needed, we will be forced to define a new variant of the clone() system call. To address this, eclone() allows at least 64 clone flags with some room for more if necessary. To prevent unprivileged processes from misusing this interface, eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL. See Documentation/eclone in next patch for more details and an example of its usage. NOTE: - System calls are restricted to 6 parameters and the number and sizes of parameters needed for eclone() exceed 6 integers. The new prototype works around this restriction while providing some flexibility if eclone() needs to be further extended in the future. TODO: - We should convert clone-flags to 64-bit value in all architectures. Its probably best to do that as a separate patchset since clone_flags touches several functions and that patchset seems independent of this new system call. Changelog[v14]: - [Oren Laadan] Rebase to kernel 2.6.33 * introduce PTREGSCALL4 for sys_eclone * consolidate syscall definitions for 32/64 bit - [Oren Laadan] Merge x86_64 (trivial patch) with current - [Serge Hallyn] Add eclone stub for ia32 eclone Changelog[v13]: - [Dave Hansen]: Reorg to enable sharing code between x86 and x86-64. - [Arnd Bergmann]: With args_size parameter, ->reserved1 is redundant and can be removed. - [Nathan Lynch]: stop warnings about assigning u64 to a (32-bit) int*. - [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to ->child_stack and ensure ->child_stack_size is 0 on architectures that don't need it (see comments in types.h for details). Changelog[v12]: - [Serge Hallyn] Ignore ->child_stack_size if ->child_stack_base is NULL. - [Oren Laadan, Serge Hallyn] Rename clone_with_pids() to eclone() Changelog[v11]: - [Dave Hansen] Move clone_args validation checks to arch-indpeendent code. - [Oren Laadan] Make args_size a parameter to system call and remove it from 'struct clone_args' Changelog[v10]: - Rename clone3() to clone_with_pids() - [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall implementation Changelog[v9]: - [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit architectures split the new clone-flags into 'low' and 'high' words and pass in the 'lower' flags as the first argument. This would maintain similarity of the clone3() with clone()/ clone2(). Also has the side-effect of the name matching the number of parameters :-) - [Roland McGrath] Rename structure to 'clone_args' and add a 'child_stack_size' field Changelog[v8] - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg' must be 64-bit. - clone2() is in use in IA64. Rename system call to clone3(). Changelog[v7]: - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2() and group parameters into a new 'struct clone_struct' object. Changelog[v6]: - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' is constant across architectures. - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove 'unum_pids < 0' check. Changelog[v4]: - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set' Changelog[v3]: - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid in the target_pids[] list and setting it 0. See copy_target_pids()). - (Oren Laadan) Specified target pids should apply only to youngest pid-namespaces (see copy_target_pids()) - (Matt Helsley) Update patch description. Changelog[v2]: - Remove unnecessary printk and add a note to callers of copy_target_pids() to free target_pids. - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and 'num_pids == 0' (fall back to normal clone()). - Move arch-independent code (sanity checks and copy-in of target-pids) into kernel/fork.c and simplify sys_clone_with_pids() Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl.cs.columbia.edu> --- arch/x86/ia32/ia32entry.S | 2 + arch/x86/include/asm/syscalls.h | 2 + arch/x86/include/asm/unistd_32.h | 3 +- arch/x86/include/asm/unistd_64.h | 2 + arch/x86/kernel/entry_32.S | 14 ++++ arch/x86/kernel/entry_64.S | 1 + arch/x86/kernel/process.c | 40 +++++++++++- arch/x86/kernel/syscall_table_32.S | 1 + include/linux/sched.h | 2 + include/linux/types.h | 16 +++++ kernel/fork.c | 124 +++++++++++++++++++++++++++++++++++- 11 files changed, 204 insertions(+), 3 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 53147ad..5eec1d9 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -477,6 +477,7 @@ quiet_ni_syscall: PTREGSCALL stub32_clone, sys32_clone, %rdx PTREGSCALL stub32_vfork, sys_vfork, %rdi PTREGSCALL stub32_iopl, sys_iopl, %rsi + PTREGSCALL stub32_eclone, sys_eclone, %r8 ENTRY(ia32_ptregs_common) popq %r11 @@ -842,4 +843,5 @@ ia32_sys_call_table: .quad compat_sys_rt_tgsigqueueinfo /* 335 */ .quad sys_perf_event_open .quad compat_sys_recvmmsg + .quad stub32_eclone ia32_syscall_end: diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 8868b94..972ab0e 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -27,6 +27,8 @@ long sys_execve(char __user *, char __user * __user *, char __user * __user *, struct pt_regs *); long sys_clone(unsigned long, unsigned long, void __user *, void __user *, struct pt_regs *); +long sys_eclone(unsigned flags_low, struct clone_args __user *uca, + int args_size, pid_t __user *pids, struct pt_regs *regs); /* kernel/ldt.c */ asmlinkage int sys_modify_ldt(int, void __user *, unsigned long); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 3baf379..cd7ca6a 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -343,10 +343,11 @@ #define __NR_rt_tgsigqueueinfo 335 #define __NR_perf_event_open 336 #define __NR_recvmmsg 337 +#define __NR_eclone 338 #ifdef __KERNEL__ -#define NR_syscalls 338 +#define NR_syscalls 339 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index 4843f7b..d87318d 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo) __SYSCALL(__NR_perf_event_open, sys_perf_event_open) #define __NR_recvmmsg 299 __SYSCALL(__NR_recvmmsg, sys_recvmmsg) +#define __NR_eclone 300 +__SYSCALL(__NR_eclone, stub_eclone) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 44a8e0d..65e1735 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -758,6 +758,19 @@ ptregs_##name: \ addl $4,%esp; \ ret +#define PTREGSCALL4(name) \ + ALIGN; \ +ptregs_##name: \ + leal 4(%esp),%eax; \ + pushl %eax; \ + pushl PT_ESI(%eax); \ + movl PT_EDX(%eax),%ecx; \ + movl PT_ECX(%eax),%edx; \ + movl PT_EBX(%eax),%eax; \ + call sys_##name; \ + addl $8,%esp; \ + ret + PTREGSCALL1(iopl) PTREGSCALL0(fork) PTREGSCALL0(vfork) @@ -767,6 +780,7 @@ PTREGSCALL0(sigreturn) PTREGSCALL0(rt_sigreturn) PTREGSCALL2(vm86) PTREGSCALL1(vm86old) +PTREGSCALL4(eclone) /* Clone is an oddball. The 4th arg is in %edi */ ALIGN; diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 0697ff1..216681e 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -698,6 +698,7 @@ END(\label) PTREGSCALL stub_vfork, sys_vfork, %rdi PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx PTREGSCALL stub_iopl, sys_iopl, %rsi + PTREGSCALL stub_eclone, sys_eclone, %r8 ENTRY(ptregscall_common) DEFAULT_FRAME 1 8 /* offset 8: return address */ diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c9b3522..b2352d9 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -252,6 +252,45 @@ sys_clone(unsigned long clone_flags, unsigned long newsp, return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid); } +long +sys_eclone(unsigned flags_low, struct clone_args __user *uca, + int args_size, pid_t __user *pids, struct pt_regs *regs) +{ + int rc; + struct clone_args kca; + unsigned long flags; + int __user *parent_tidp; + int __user *child_tidp; + unsigned long __user stack; + unsigned long stack_size; + + rc = fetch_clone_args_from_user(uca, args_size, &kca); + if (rc) + return rc; + + /* + * TODO: Convert 'clone-flags' to 64-bits on all architectures. + * TODO: When ->clone_flags_high is non-zero, copy it in to the + * higher word(s) of 'flags': + * + * flags = (kca.clone_flags_high << 32) | flags_low; + */ + flags = flags_low; + parent_tidp = (int *)(unsigned long)kca.parent_tid_ptr; + child_tidp = (int *)(unsigned long)kca.child_tid_ptr; + + stack_size = (unsigned long)kca.child_stack_size; + if (stack_size) + return -EINVAL; + + stack = (unsigned long)kca.child_stack; + if (!stack) + stack = regs->sp; + + return do_fork_with_pids(flags, stack, regs, stack_size, parent_tidp, + child_tidp, kca.nr_pids, pids); +} + /* * This gets run with %si containing the * function to call, and %di containing @@ -677,4 +716,3 @@ unsigned long arch_randomize_brk(struct mm_struct *mm) unsigned long range_end = mm->brk + 0x02000000; return randomize_range(mm->brk, range_end, 0) ? : mm->brk; } - diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index 15228b5..22ae7ef 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -337,3 +337,4 @@ ENTRY(sys_call_table) .long sys_rt_tgsigqueueinfo /* 335 */ .long sys_perf_event_open .long sys_recvmmsg + .long ptregs_eclone diff --git a/include/linux/sched.h b/include/linux/sched.h index 4f079f7..bcc44ad 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2189,6 +2189,8 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern int fetch_clone_args_from_user(struct clone_args __user *, int, + struct clone_args *); extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, unsigned int, pid_t __user *); diff --git a/include/linux/types.h b/include/linux/types.h index c42724f..d8bfd6b 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -204,6 +204,22 @@ struct ustat { char f_fpack[6]; }; +struct clone_args { + u64 clone_flags_high; + /* + * Architectures can use child_stack for either the stack pointer or + * the base of of stack. If child_stack is used as the stack pointer, + * child_stack_size must be 0. Otherwise child_stack_size must be + * set to size of allocated stack. + */ + u64 child_stack; + u64 child_stack_size; + u64 parent_tid_ptr; + u64 child_tid_ptr; + u32 nr_pids; + u32 reserved0; +}; + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index fb92128..0f202ae 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1370,6 +1370,114 @@ struct task_struct * __cpuinit fork_idle(int cpu) } /* + * If user specified any 'target-pids' in @upid_setp, copy them from + * user and return a pointer to a local copy of the list of pids. The + * caller must free the list, when they are done using it. + * + * If user did not specify any target pids, return NULL (caller should + * treat this like normal clone). + * + * On any errors, return the error code + */ +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids) +{ + int j; + int rc; + int size; + int knum_pids; /* # of pids needed in kernel */ + pid_t *target_pids; + + if (!unum_pids) + return NULL; + + knum_pids = task_pid(current)->level + 1; + if (unum_pids > knum_pids) + return ERR_PTR(-EINVAL); + + /* + * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[] + * and set it to 0. This last entry in target_pids[] corresponds to the + * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was + * specified. If CLONE_NEWPID was not specified, this last entry will + * simply be ignored. + */ + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL); + if (!target_pids) + return ERR_PTR(-ENOMEM); + + /* + * A process running in a level 2 pid namespace has three pid namespaces + * and hence three pid numbers. If this process is checkpointed, + * information about these three namespaces are saved. We refer to these + * namespaces as 'known namespaces'. + * + * If this checkpointed process is however restarted in a level 3 pid + * namespace, the restarted process has an extra ancestor pid namespace + * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'. + * + * During restart, the process requests specific pids for its 'known + * namespaces' and lets kernel assign pids to its 'unknown namespaces'. + * + * Since the requested-pids correspond to 'known namespaces' and since + * 'known-namespaces' are younger than (i.e descendants of) 'unknown- + * namespaces', copy requested pids to the back-end of target_pids[] + * (i.e before the last entry for CLONE_NEWPID mentioned above). + * Any entries in target_pids[] not corresponding to a requested pid + * will be set to zero and kernel assigns a pid in those namespaces. + * + * NOTE: The order of pids in target_pids[] is oldest pid namespace to + * youngest (target_pids[0] corresponds to init_pid_ns). i.e. + * the order is: + * + * - pids for 'unknown-namespaces' (if any) + * - pids for 'known-namespaces' (requested pids) + * - 0 in the last entry (for CLONE_NEWPID). + */ + j = knum_pids - unum_pids; + size = unum_pids * sizeof(pid_t); + + rc = copy_from_user(&target_pids[j], upids, size); + if (rc) { + rc = -EFAULT; + goto out_free; + } + + return target_pids; + +out_free: + kfree(target_pids); + return ERR_PTR(rc); +} + +int +fetch_clone_args_from_user(struct clone_args __user *uca, int args_size, + struct clone_args *kca) +{ + int rc; + + /* + * TODO: If size of clone_args is not what the kernel expects, it + * could be that kernel is newer and has an extended structure. + * When that happens, this check needs to be smarter. For now, + * assume exact match. + */ + if (args_size != sizeof(struct clone_args)) + return -EINVAL; + + rc = copy_from_user(kca, uca, args_size); + if (rc) + return -EFAULT; + + /* + * To avoid future compatibility issues, ensure unused fields are 0. + */ + if (kca->reserved0 || kca->clone_flags_high) + return -EINVAL; + + return 0; +} + +/* * Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts @@ -1387,7 +1495,7 @@ long do_fork_with_pids(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; + pid_t *target_pids; /* * Do some preliminary argument and permissions checking before we @@ -1421,6 +1529,16 @@ long do_fork_with_pids(unsigned long clone_flags, } } + target_pids = copy_target_pids(num_pids, upids); + if (target_pids) { + if (IS_ERR(target_pids)) + return PTR_ERR(target_pids); + + nr = -EPERM; + if (!capable(CAP_SYS_ADMIN)) + goto out_free; + } + /* * When called from kernel_thread, don't do user tracing stuff. */ @@ -1482,6 +1600,10 @@ long do_fork_with_pids(unsigned long clone_flags, } else { nr = PTR_ERR(p); } + +out_free: + kfree(target_pids); + return nr; } -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 09/96] eclone (9/11): Implement sys_eclone for s390 2010-03-17 16:07 ` [C/R v20][PATCH 08/96] eclone (8/11): Implement sys_eclone for x86 (32,64) Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 10/96] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers From: Serge E. Hallyn <serue@us.ibm.com> Implement the s390 hook for sys_eclone(). Changelog: Nov 24: Removed user-space code from commit log. See user-cr git tree. Nov 17: remove redundant flags_high check Nov 13: As suggested by Heiko, convert eclone to take its parameters via registers. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> --- arch/s390/include/asm/unistd.h | 3 ++- arch/s390/kernel/compat_linux.c | 17 +++++++++++++++++ arch/s390/kernel/compat_wrapper.S | 8 ++++++++ arch/s390/kernel/process.c | 37 +++++++++++++++++++++++++++++++++++++ arch/s390/kernel/syscalls.S | 1 + 5 files changed, 65 insertions(+), 1 deletions(-) diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h index 6e9f049..2250950 100644 --- a/arch/s390/include/asm/unistd.h +++ b/arch/s390/include/asm/unistd.h @@ -269,7 +269,8 @@ #define __NR_pwritev 329 #define __NR_rt_tgsigqueueinfo 330 #define __NR_perf_event_open 331 -#define NR_syscalls 332 +#define __NR_eclone 332 +#define NR_syscalls 333 /* * There are some system calls that are not present on 64 bit, some diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c index 11c3aba..f9e8983 100644 --- a/arch/s390/kernel/compat_linux.c +++ b/arch/s390/kernel/compat_linux.c @@ -663,6 +663,23 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count) return sys_write(fd, buf, count); } +asmlinkage long sys32_clone(void) +{ + struct pt_regs *regs = task_pt_regs(current); + unsigned long clone_flags; + unsigned long newsp; + int __user *parent_tidptr, *child_tidptr; + + clone_flags = regs->gprs[3] & 0xffffffffUL; + newsp = regs->orig_gpr2 & 0x7fffffffUL; + parent_tidptr = compat_ptr(regs->gprs[4]); + child_tidptr = compat_ptr(regs->gprs[5]); + if (!newsp) + newsp = regs->gprs[15]; + return do_fork(clone_flags, newsp, regs, 0, + parent_tidptr, child_tidptr); +} + /* * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64. * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE} diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S index 30de2d0..cfa227e 100644 --- a/arch/s390/kernel/compat_wrapper.S +++ b/arch/s390/kernel/compat_wrapper.S @@ -1847,6 +1847,14 @@ sys_clone_wrapper: llgtr %r5,%r5 # int * jg sys_clone # branch to system call + .globl sys_eclone_wrapper +sys_eclone_wrapper: + llgfr %r2,%r2 # unsigned int + llgtr %r3,%r3 # struct clone_args * + lgfr %r4,%r4 # int + llgtr %r5,%r5 # pid_t * + jg sys_eclone # branch to system call + .globl sys32_execve_wrapper sys32_execve_wrapper: llgtr %r2,%r2 # char * diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c index 00b6d1d..5b0729a 100644 --- a/arch/s390/kernel/process.c +++ b/arch/s390/kernel/process.c @@ -240,6 +240,43 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags, parent_tidptr, child_tidptr); } +SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *, + uca, int, args_size, pid_t __user *, pids) +{ + int rc; + struct pt_regs *regs = task_pt_regs(current); + struct clone_args kca; + int __user *parent_tid_ptr; + int __user *child_tid_ptr; + unsigned long flags; + unsigned long __user child_stack; + unsigned long stack_size; + + rc = fetch_clone_args_from_user(uca, args_size, &kca); + if (rc) + return rc; + + flags = flags_low; + parent_tid_ptr = (int __user *) kca.parent_tid_ptr; + child_tid_ptr = (int __user *) kca.child_tid_ptr; + + stack_size = (unsigned long) kca.child_stack_size; + if (stack_size) + return -EINVAL; + + child_stack = (unsigned long) kca.child_stack; + if (!child_stack) + child_stack = regs->gprs[15]; + + /* + * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value + * to several functions. Need to convert clone_flags to 64-bit. + */ + return do_fork_with_pids(flags, child_stack, regs, stack_size, + parent_tid_ptr, child_tid_ptr, kca.nr_pids, + pids); +} + /* * This is trivial, and on the face of it looks like it * could equally well be done in user mode. diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S index 30eca07..fb8708d 100644 --- a/arch/s390/kernel/syscalls.S +++ b/arch/s390/kernel/syscalls.S @@ -340,3 +340,4 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper) SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper) SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */ SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper) +SYSCALL(sys_eclone,sys_eclone,sys_eclone_wrapper) -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 10/96] eclone (10/11): Implement sys_eclone for powerpc 2010-03-17 16:07 ` [C/R v20][PATCH 09/96] eclone (9/11): Implement sys_eclone for s390 Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:07 ` [C/R v20][PATCH 11/96] eclone (11/11): Document sys_eclone Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Nathan Lynch From: Nathan Lynch <ntl@pobox.com> Wired up for both ppc32 and ppc64, but tested only with the latter. Changelog: - Jan 20: (ntl) fix 32-bit build - Nov 17: (serge) remove redundant flags_high check, and don't fold it into flags. Signed-off-by: Nathan Lynch <ntl@pobox.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> --- arch/powerpc/include/asm/syscalls.h | 6 ++++ arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/asm/unistd.h | 3 +- arch/powerpc/kernel/entry_32.S | 8 +++++ arch/powerpc/kernel/entry_64.S | 5 +++ arch/powerpc/kernel/process.c | 54 ++++++++++++++++++++++++++++++++++- 6 files changed, 75 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index eb8eb40..1674544 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -24,6 +24,12 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1, asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp, int __user *parent_tidp, void __user *child_threadptr, int __user *child_tidp, int p6, struct pt_regs *regs); +asmlinkage int sys_eclone(unsigned long flags_low, + struct clone_args __user *args, + size_t args_size, + pid_t __user *pids, + unsigned long p5, unsigned long p6, + struct pt_regs *regs); asmlinkage int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3, unsigned long p4, unsigned long p5, unsigned long p6, struct pt_regs *regs); diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 07d2d19..ee41254 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -326,3 +326,4 @@ SYSCALL_SPU(perf_event_open) COMPAT_SYS_SPU(preadv) COMPAT_SYS_SPU(pwritev) COMPAT_SYS(rt_tgsigqueueinfo) +PPC_SYS(eclone) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index f6ca761..37357a2 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -345,10 +345,11 @@ #define __NR_preadv 320 #define __NR_pwritev 321 #define __NR_rt_tgsigqueueinfo 322 +#define __NR_eclone 323 #ifdef __KERNEL__ -#define __NR_syscalls 323 +#define __NR_syscalls 324 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 1175a85..579f1da 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -586,6 +586,14 @@ ppc_clone: stw r0,_TRAP(r1) /* register set saved */ b sys_clone + .globl ppc_eclone +ppc_eclone: + SAVE_NVGPRS(r1) + lwz r0,_TRAP(r1) + rlwinm r0,r0,0,0,30 /* clear LSB to indicate full */ + stw r0,_TRAP(r1) /* register set saved */ + b sys_eclone + .globl ppc_swapcontext ppc_swapcontext: SAVE_NVGPRS(r1) diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index bdcb557..899f485 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -344,6 +344,11 @@ _GLOBAL(ppc_clone) bl .sys_clone b syscall_exit +_GLOBAL(ppc_eclone) + bl .save_nvgprs + bl .sys_eclone + b syscall_exit + _GLOBAL(ppc32_swapcontext) bl .save_nvgprs bl .compat_sys_swapcontext diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 7b816da..4bbc21f 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -885,7 +885,59 @@ int sys_clone(unsigned long clone_flags, unsigned long usp, child_tidp = TRUNC_PTR(child_tidp); } #endif - return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp); + return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp); +} + +int sys_eclone(unsigned long clone_flags_low, + struct clone_args __user *uclone_args, + size_t size, + pid_t __user *upids, + unsigned long p5, unsigned long p6, + struct pt_regs *regs) +{ + struct clone_args kclone_args; + unsigned long stack_base; + int __user *parent_tidp; + int __user *child_tidp; + unsigned long stack_sz; + unsigned int nr_pids; + unsigned long flags; + unsigned long usp; + int rc; + + CHECK_FULL_REGS(regs); + + rc = fetch_clone_args_from_user(uclone_args, size, &kclone_args); + if (rc) + return rc; + + stack_sz = kclone_args.child_stack_size; + stack_base = kclone_args.child_stack; + + /* powerpc doesn't do anything useful with the stack size */ + if (stack_sz) + return -EINVAL; + + /* Interpret stack_base as the child sp if it is set. */ + usp = regs->gpr[1]; + if (stack_base) + usp = stack_base; + + flags = clone_flags_low; + + nr_pids = kclone_args.nr_pids; + + parent_tidp = (int __user *)(unsigned long)kclone_args.parent_tid_ptr; + child_tidp = (int __user *)(unsigned long)kclone_args.child_tid_ptr; + +#ifdef CONFIG_PPC64 + if (test_thread_flag(TIF_32BIT)) { + parent_tidp = TRUNC_PTR(parent_tidp); + child_tidp = TRUNC_PTR(child_tidp); + } +#endif + return do_fork_with_pids(flags, stack_base, regs, stack_sz, + parent_tidp, child_tidp, nr_pids, upids); } int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3, -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 11/96] eclone (11/11): Document sys_eclone 2010-03-17 16:07 ` [C/R v20][PATCH 10/96] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan @ 2010-03-17 16:07 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 12/96] c/r: extend arch_setup_additional_pages() Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Sukadev Bhattiprolu From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> This gives a brief overview of the eclone() system call. We should eventually describe more details in existing clone(2) man page or in a new man page. Changelog[v13]: - [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to ->child_stack and ensure ->child_stack_size is 0 on architectures that don't need it. - [Arnd Bergmann] Remove ->reserved1 field - [Louis Rilling, Dave Hansen] Combine the two asm statements in the example into one and use memory constraint to avoid unncessary copies. Changelog[v12]: - [Serge Hallyn] Fix/simplify stack-setup in the example code - [Serge Hallyn, Oren Laadan] Rename syscall to eclone() Changelog[v11]: - [Dave Hansen] Move clone_args validation checks to arch-indpendent code. - [Oren Laadan] Make args_size a parameter to system call and remove it from 'struct clone_args' - [Oren Laadan] Fix some typos and clarify the order of pids in the @pids parameter. Changelog[v10]: - Rename clone3() to clone_with_pids() and fix some typos. - Modify example to show usage with the ptregs implementation. Changelog[v9]: - [Pavel Machek]: Fix an inconsistency and rename new file to Documentation/clone3. - [Roland McGrath, H. Peter Anvin] Updates to description and example to reflect new prototype of clone3() and the updated/ renamed 'struct clone_args'. Changelog[v8]: - clone2() is already in use in IA64. Rename syscall to clone3() - Add notes to say that we return -EINVAL if invalid clone flags are specified or if the reserved fields are not 0. Changelog[v7]: - Rename clone_with_pids() to clone2() - Changes to reflect new prototype of clone2() (using clone_struct). Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- Documentation/eclone | 348 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 348 insertions(+), 0 deletions(-) create mode 100644 Documentation/eclone diff --git a/Documentation/eclone b/Documentation/eclone new file mode 100644 index 0000000..c2f1b4b --- /dev/null +++ b/Documentation/eclone @@ -0,0 +1,348 @@ + +struct clone_args { + u64 clone_flags_high; + u64 child_stack; + u64 child_stack_size; + u64 parent_tid_ptr; + u64 child_tid_ptr; + u32 nr_pids; + u32 reserved0; +}; + + +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, + pid_t * __user pids) + + In addition to doing everything that clone() system call does, the + eclone() system call: + + - allows additional clone flags (31 of 32 bits in the flags + parameter to clone() are in use) + + - allows user to specify a pid for the child process in its + active and ancestor pid namespaces. + + This system call is meant to be used when restarting an application + from a checkpoint. Such restart requires that the processes in the + application have the same pids they had when the application was + checkpointed. When containers are nested, the processes within the + containers exist in multiple pid namespaces and hence have multiple + pids to specify during restart. + + The @flags_low parameter is identical to the 'clone_flags' parameter + in existing clone() system call. + + The fields in 'struct clone_args' are meant to be used as follows: + + u64 clone_flags_high: + + When eclone() supports more than 32 flags, the additional bits + in the clone_flags should be specified in this field. This + field is currently unused and must be set to 0. + + u64 child_stack; + u64 child_stack_size; + + These two fields correspond to the 'child_stack' fields in + clone() and clone2() (on IA64) system calls. The usage of + these two fields depends on the processor architecture. + + Most architectures use ->child_stack to pass-in a stack-pointer + itself and don't need the ->child_stack_size field. On these + architectures the ->child_stack_size field must be 0. + + Some architectures, eg IA64, use ->child_stack to pass-in the + base of the region allocated for stack. These architectures + must pass in the size of the stack-region in ->child_stack_size. + + u64 parent_tid_ptr; + u64 child_tid_ptr; + + These two fields correspond to the 'parent_tid_ptr' and + 'child_tid_ptr' fields in the clone() system call + + u32 nr_pids; + + nr_pids specifies the number of pids in the @pids array + parameter to eclone() (see below). nr_pids should not exceed + the current nesting level of the calling process (i.e if the + process is in init_pid_ns, nr_pids must be 1, if process is + in a pid namespace that is a child of init-pid-ns, nr_pids + cannot exceed 2, and so on). + + u32 reserved0; + u64 reserved1; + + These fields are intended to extend the functionality of the + eclone() in the future, while preserving backward compatibility. + They must be set to 0 for now. + + The @cargs_size parameter specifes the sizeof(struct clone_args) and + is intended to enable extending this structure in the future, while + preserving backward compatibility. For now, this field must be set + to the sizeof(struct clone_args) and this size must match the kernel's + view of the structure. + + The @pids parameter defines the set of pids that should be assigned to + the child process in its active and ancestor pid namespaces. The + descendant pid namespaces do not matter since a process does not have a + pid in descendant namespaces, unless the process is in a new pid + namespace in which case the process is a container-init (and must have + the pid 1 in that namespace). + + See CLONE_NEWPID section of clone(2) man page for details about pid + namespaces. + + If a pid in the @pids list is 0, the kernel will assign the next + available pid in the pid namespace. + + If a pid in the @pids list is non-zero, the kernel tries to assign + the specified pid in that namespace. If that pid is already in use + by another process, the system call fails (see EBUSY below). + + The order of pids in @pids is oldest in pids[0] to youngest pid + namespace in pids[nr_pids-1]. If the number of pids specified in the + @pids list is fewer than the nesting level of the process, the pids + are applied from youngest namespace. i.e if the process is nested in + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to + have a pid of '0' (the kernel will assign a pid in those namespaces). + + On success, the system call returns the pid of the child process in + the parent's active pid namespace. + + On failure, eclone() returns -1 and sets 'errno' to one of following + values (the child process is not created). + + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to + specify the pids in this call (if pids are not specifed + CAP_SYS_ADMIN is not required). + + EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds + the current nesting level of parent process + + EINVAL Not all specified clone-flags are valid. + + EINVAL The reserved fields in the clone_args argument are not 0. + + EINVAL The child_stack_size field is not 0 (on architectures that + pass in a stack pointer in ->child_stack field) + + EBUSY A requested pid is in use by another process in that namespace. + +--- +/* + * Example eclone() usage - Create a child process with pid CHILD_TID1 in + * the current pid namespace. The child gets the usual "random" pid in any + * ancestor pid namespaces. + */ +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <signal.h> +#include <errno.h> +#include <unistd.h> +#include <wait.h> +#include <sys/syscall.h> + +#define __NR_eclone 337 +#define CLONE_NEWPID 0x20000000 +#define CLONE_CHILD_SETTID 0x01000000 +#define CLONE_PARENT_SETTID 0x00100000 +#define CLONE_UNUSED 0x00001000 + +#define STACKSIZE 8192 + +typedef unsigned long long u64; +typedef unsigned int u32; +typedef int pid_t; +struct clone_args { + u64 clone_flags_high; + u64 child_stack; + u64 child_stack_size; + + u64 parent_tid_ptr; + u64 child_tid_ptr; + + u32 nr_pids; + + u32 reserved0; +}; + +#define exit _exit + +/* + * Following eclone() is based on code posted by Oren Laadan at: + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html + */ +#if defined(__i386__) && defined(__NR_eclone) + +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size, + int *pids) +{ + long retval; + + __asm__ __volatile__( + "movl %3, %%ebx\n\t" /* flags_low -> 1st (ebx) */ + "movl %4, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/ + "movl %5, %%edx\n\t" /* args_size -> 3rd (edx) */ + "movl %6, %%edi\n\t" /* pids -> 4th (edi)*/ + + "pushl %%ebp\n\t" /* save value of ebp */ + "int $0x80\n\t" /* Linux/i386 system call */ + "testl %0,%0\n\t" /* check return value */ + "jne 1f\n\t" /* jump if parent */ + + "popl %%esi\n\t" /* get subthread function */ + "call *%%esi\n\t" /* start subthread function */ + "movl %2,%0\n\t" + "int $0x80\n" /* exit system call: exit subthread */ + "1:\n\t" + "popl %%ebp\t" /* restore parent's ebp */ + + :"=a" (retval) + + :"0" (__NR_eclone), + "i" (__NR_exit), + "m" (flags_low), + "m" (clone_args), + "m" (args_size), + "m" (pids) + ); + + if (retval < 0) { + errno = -retval; + retval = -1; + } + return retval; +} + +/* + * Allocate a stack for the clone-child and arrange to have the child + * execute @child_fn with @child_arg as the argument. + */ +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size) +{ + void *stack_base; + void **stack_top; + + stack_base = malloc(size + size); + if (!stack_base) { + perror("malloc()"); + exit(1); + } + + stack_top = (void **)((char *)stack_base + (size - 4)); + *--stack_top = child_arg; + *--stack_top = child_fn; + + return stack_top; +} +#endif + +/* gettid() is a bit more useful than getpid() when messing with clone() */ +int gettid() +{ + int rc; + + rc = syscall(__NR_gettid, 0, 0, 0); + if (rc < 0) { + printf("rc %d, errno %d\n", rc, errno); + exit(1); + } + return rc; +} + +#define CHILD_TID1 377 +#define CHILD_TID2 1177 +#define CHILD_TID3 2799 + +struct clone_args clone_args; +void *child_arg = &clone_args; +int child_tid; + +int do_child(void *arg) +{ + struct clone_args *cs = (struct clone_args *)arg; + int ctid; + + /* Verify we pushed the arguments correctly on the stack... */ + if (arg != child_arg) { + printf("Child: Incorrect child arg pointer, expected %p," + "actual %p\n", child_arg, arg); + exit(1); + } + + /* ... and that we got the thread-id we expected */ + ctid = *((int *)(unsigned long)cs->child_tid_ptr); + if (ctid != CHILD_TID1) { + printf("Child: Incorrect child tid, expected %d, actual %d\n", + CHILD_TID1, ctid); + exit(1); + } else { + printf("Child got the expected tid, %d\n", gettid()); + } + sleep(2); + + printf("[%d, %d]: Child exiting\n", getpid(), ctid); + exit(0); +} + +static int do_clone(int (*child_fn)(void *), void *child_arg, + unsigned int flags_low, int nr_pids, pid_t *pids_list) +{ + int rc; + void *stack; + struct clone_args *ca = &clone_args; + int args_size; + + stack = setup_stack(child_fn, child_arg, STACKSIZE); + + memset(ca, 0, sizeof(*ca)); + + ca->child_stack = (u64)(unsigned long)stack; + ca->child_stack_size = (u64)0; + ca->child_tid_ptr = (u64)(unsigned long)&child_tid; + ca->nr_pids = nr_pids; + + args_size = sizeof(struct clone_args); + rc = eclone(flags_low, ca, args_size, pids_list); + + printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(), + rc, errno); + return rc; +} + +/* + * Multiple pid_t pid_t values in pids_list[] here are just for illustration. + * The test case creates a child in the current pid namespace and uses only + * the first value, CHILD_TID1. + */ +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 }; +int main() +{ + int rc, pid, status; + unsigned long flags; + int nr_pids = 1; + + flags = SIGCHLD|CLONE_CHILD_SETTID; + + pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list); + + printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid); + + rc = waitpid(pid, &status, __WALL); + if (rc < 0) { + printf("waitpid(): rc %d, error %d\n", rc, errno); + } else { + printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(), + gettid(), rc, status); + + if (WIFEXITED(status)) { + printf("\t EXITED, %d\n", WEXITSTATUS(status)); + } else if (WIFSIGNALED(status)) { + printf("\t SIGNALED, %d\n", WTERMSIG(status)); + } + } + return 0; +} -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 12/96] c/r: extend arch_setup_additional_pages() 2010-03-17 16:07 ` [C/R v20][PATCH 11/96] eclone (11/11): Document sys_eclone Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 13/96] c/r: break out new_user_ns() Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Alexey Dobriyan, Oren Laadan From: Alexey Dobriyan <adobriyan@gmail.com> Add "start" argument, to request to map vDSO to a specific place, and fail the operation if not. This is useful for restart(2) to ensure that memory layout is restore exactly as needed. Changelog[v19]: - [serge hallyn] Fix potential use-before-set ret Changelog[v2]: - [ntl] powerpc: vdso build fix (ckpt-v17) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- arch/powerpc/include/asm/elf.h | 1 + arch/powerpc/kernel/vdso.c | 13 ++++++++++++- arch/s390/include/asm/elf.h | 2 +- arch/s390/kernel/vdso.c | 13 ++++++++++++- arch/sh/include/asm/elf.h | 1 + arch/sh/kernel/vsyscall/vsyscall.c | 2 +- arch/x86/include/asm/elf.h | 3 ++- arch/x86/vdso/vdso32-setup.c | 9 +++++++-- arch/x86/vdso/vma.c | 11 ++++++++--- fs/binfmt_elf.c | 2 +- 10 files changed, 46 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h index c376eda..0b06255 100644 --- a/arch/powerpc/include/asm/elf.h +++ b/arch/powerpc/include/asm/elf.h @@ -266,6 +266,7 @@ extern int ucache_bsize; #define ARCH_HAS_SETUP_ADDITIONAL_PAGES struct linux_binprm; extern int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp); #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b); diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index d84d192..74210ab 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -188,7 +188,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma) * This is called from binfmt_elf, we create the special vma for the * vDSO and insert it into the mm struct tree */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp) { struct mm_struct *mm = current->mm; struct page **vdso_pagelist; @@ -220,6 +221,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) vdso_base = VDSO32_MBASE; #endif + /* in case restart(2) mandates a specific location */ + if (start) + vdso_base = start; + current->mm->context.vdso_base = 0; /* vDSO has a problem and was disabled, just don't "enable" it for the @@ -249,6 +254,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) /* Add required alignment. */ vdso_base = ALIGN(vdso_base, VDSO_ALIGNMENT); + /* for restart(2), double check that we got we asked for */ + if (start && vdso_base != start) { + rc = -EBUSY; + goto fail_mmapsem; + } + /* * Put vDSO base into mm struct. We need to do this before calling * install_special_mapping or the perf counter mmap tracking code diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h index 354d426..5081938 100644 --- a/arch/s390/include/asm/elf.h +++ b/arch/s390/include/asm/elf.h @@ -216,6 +216,6 @@ do { \ struct linux_binprm; #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1 -int arch_setup_additional_pages(struct linux_binprm *, int); +int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int); #endif diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c index 5f99e66..706c16a 100644 --- a/arch/s390/kernel/vdso.c +++ b/arch/s390/kernel/vdso.c @@ -194,7 +194,8 @@ static void vdso_init_cr5(void) * This is called from binfmt_elf, we create the special vma for the * vDSO and insert it into the mm struct tree */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp) { struct mm_struct *mm = current->mm; struct page **vdso_pagelist; @@ -225,6 +226,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) vdso_pages = vdso32_pages; #endif + /* in case restart(2) mandates a specific location */ + if (start) + vdso_base = start; + /* * vDSO has a problem and was disabled, just don't "enable" it for * the process @@ -247,6 +252,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) goto out_up; } + /* for restart(2), double check that we got we asked for */ + if (start && vdso_base != start) { + rc = -EINVAL; + goto out_up; + } + /* * Put vDSO base into mm struct. We need to do this before calling * install_special_mapping or the perf counter mmap tracking code diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h index ac04255..036ea4b 100644 --- a/arch/sh/include/asm/elf.h +++ b/arch/sh/include/asm/elf.h @@ -201,6 +201,7 @@ do { \ #define ARCH_HAS_SETUP_ADDITIONAL_PAGES struct linux_binprm; extern int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp); extern unsigned int vdso_enabled; diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c index 3f7e415..64c70e5 100644 --- a/arch/sh/kernel/vsyscall/vsyscall.c +++ b/arch/sh/kernel/vsyscall/vsyscall.c @@ -59,7 +59,7 @@ int __init vsyscall_init(void) } /* Setup a VMA at program startup for the vsyscall page */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long start, int uses_interp) { struct mm_struct *mm = current->mm; unsigned long addr; diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index f2ad216..3761be8 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -312,9 +312,10 @@ struct linux_binprm; #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1 extern int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp); -extern int syscall32_setup_pages(struct linux_binprm *, int exstack); +extern int syscall32_setup_pages(struct linux_binprm *, unsigned long start, int exstack); #define compat_arch_setup_additional_pages syscall32_setup_pages extern unsigned long arch_randomize_brk(struct mm_struct *mm); diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c index 02b442e..62043c1 100644 --- a/arch/x86/vdso/vdso32-setup.c +++ b/arch/x86/vdso/vdso32-setup.c @@ -310,7 +310,8 @@ int __init sysenter_setup(void) } /* Setup a VMA at program startup for the vsyscall page */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp) { struct mm_struct *mm = current->mm; unsigned long addr; @@ -331,13 +332,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) if (compat) addr = VDSO_HIGH_BASE; else { - addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0); + addr = get_unmapped_area(NULL, start, PAGE_SIZE, 0, 0); if (IS_ERR_VALUE(addr)) { ret = addr; goto up_fail; } } + /* for restart(2), double check that we got we asked for */ + if (start && addr != start) + goto up_fail; + current->mm->context.vdso = (void *)addr; if (compat_uses_vma || !compat) { diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c index 21e1aeb..b10ed32 100644 --- a/arch/x86/vdso/vma.c +++ b/arch/x86/vdso/vma.c @@ -99,23 +99,28 @@ static unsigned long vdso_addr(unsigned long start, unsigned len) /* Setup a VMA at program startup for the vsyscall page. Not called for compat tasks */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp) { struct mm_struct *mm = current->mm; unsigned long addr; - int ret; + int ret = -EINVAL; if (!vdso_enabled) return 0; down_write(&mm->mmap_sem); - addr = vdso_addr(mm->start_stack, vdso_size); + addr = start ? : vdso_addr(mm->start_stack, vdso_size); addr = get_unmapped_area(NULL, addr, vdso_size, 0, 0); if (IS_ERR_VALUE(addr)) { ret = addr; goto up_fail; } + /* for restart(2), double check that we got we asked for */ + if (start && addr != start) + goto up_fail; + current->mm->context.vdso = (void *)addr; ret = install_special_mapping(mm, addr, vdso_size, diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index fd5b2ea..50e30ff 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -922,7 +922,7 @@ static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs) set_binfmt(&elf_format); #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES - retval = arch_setup_additional_pages(bprm, !!elf_interpreter); + retval = arch_setup_additional_pages(bprm, 0, !!elf_interpreter); if (retval < 0) { send_sig(SIGKILL, current, 0); goto out; -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 13/96] c/r: break out new_user_ns() 2010-03-17 16:08 ` [C/R v20][PATCH 12/96] c/r: extend arch_setup_additional_pages() Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 14/96] c/r: split core function out of some set*{u,g}id functions Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers From: Serge E. Hallyn <serue@us.ibm.com> Break out the core function which checks privilege and (if allowed) creates a new user namespace, with the passed-in creating user_struct. Note that a user_namespace, unlike other namespace pointers, is not stored in the nsproxy. Rather it is purely a property of user_structs. This will let us keep the task restore code simpler. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- include/linux/user_namespace.h | 8 ++++++ kernel/user_namespace.c | 53 ++++++++++++++++++++++++++++------------ 2 files changed, 45 insertions(+), 16 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index cc4f453..f6ea75d 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns; #ifdef CONFIG_USER_NS +struct user_namespace *new_user_ns(struct user_struct *creator, + struct user_struct **newroot); static inline struct user_namespace *get_user_ns(struct user_namespace *ns) { if (ns) @@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns) #else +static inline struct user_namespace *new_user_ns(struct user_struct *creator, + struct user_struct **newroot) +{ + return ERR_PTR(-EINVAL); +} + static inline struct user_namespace *get_user_ns(struct user_namespace *ns) { return &init_user_ns; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 076c7c8..e624b0f 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -11,15 +11,8 @@ #include <linux/user_namespace.h> #include <linux/cred.h> -/* - * Create a new user namespace, deriving the creator from the user in the - * passed credentials, and replacing that user with the new root user for the - * new namespace. - * - * This is called by copy_creds(), which will finish setting the target task's - * credentials. - */ -int create_user_ns(struct cred *new) +static struct user_namespace *_new_user_ns(struct user_struct *creator, + struct user_struct **newroot) { struct user_namespace *ns; struct user_struct *root_user; @@ -27,7 +20,7 @@ int create_user_ns(struct cred *new) ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL); if (!ns) - return -ENOMEM; + return ERR_PTR(-ENOMEM); kref_init(&ns->kref); @@ -38,12 +31,43 @@ int create_user_ns(struct cred *new) root_user = alloc_uid(ns, 0); if (!root_user) { kfree(ns); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } /* set the new root user in the credentials under preparation */ - ns->creator = new->user; - new->user = root_user; + ns->creator = creator; + + /* alloc_uid() incremented the userns refcount. Just set it to 1 */ + kref_set(&ns->kref, 1); + + *newroot = root_user; + return ns; +} + +struct user_namespace *new_user_ns(struct user_struct *creator, + struct user_struct **newroot) +{ + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + return _new_user_ns(creator, newroot); +} + +/* + * Create a new user namespace, deriving the creator from the user in the + * passed credentials, and replacing that user with the new root user for the + * new namespace. + * + * This is called by copy_creds(), which will finish setting the target task's + * credentials. + */ +int create_user_ns(struct cred *new) +{ + struct user_namespace *ns; + + ns = new_user_ns(new->user, &new->user); + if (IS_ERR(ns)) + return PTR_ERR(ns); + new->uid = new->euid = new->suid = new->fsuid = 0; new->gid = new->egid = new->sgid = new->fsgid = 0; put_group_info(new->group_info); @@ -54,9 +78,6 @@ int create_user_ns(struct cred *new) #endif /* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */ - /* alloc_uid() incremented the userns refcount. Just set it to 1 */ - kref_set(&ns->kref, 1); - return 0; } -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 14/96] c/r: split core function out of some set*{u,g}id functions 2010-03-17 16:08 ` [C/R v20][PATCH 13/96] c/r: break out new_user_ns() Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 15/96] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers From: Serge E. Hallyn <serue@us.ibm.com> When restarting tasks, we want to be able to change xuid and xgid in a struct cred, and do so with security checks. Break the core functionality of set{fs,res}{u,g}id into cred_setX which performs the access checks based on current_cred(), but performs the requested change on a passed-in cred. This will allow us to securely construct struct creds based on a checkpoint image, constrained by the caller's permissions, and apply them to the caller at the end of sys_restart(). Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- include/linux/cred.h | 8 +++ kernel/cred.c | 114 ++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 134 ++++++++------------------------------------------ 3 files changed, 143 insertions(+), 113 deletions(-) diff --git a/include/linux/cred.h b/include/linux/cred.h index 4e3387a..e35631e 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -22,6 +22,9 @@ struct user_struct; struct cred; struct inode; +/* defined in sys.c, used in cred_setresuid */ +extern int set_user(struct cred *new); + /* * COW Supplementary groups list */ @@ -396,4 +399,9 @@ do { \ *(_fsgid) = __cred->fsgid; \ } while(0) +int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid); +int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid); +int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid); +int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid); + #endif /* _LINUX_CRED_H */ diff --git a/kernel/cred.c b/kernel/cred.c index 1ed8ca1..1fefcb1 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -890,3 +890,117 @@ void validate_creds_for_do_exit(struct task_struct *tsk) } #endif /* CONFIG_DEBUG_CREDENTIALS */ + +int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid) +{ + int retval; + const struct cred *old; + + retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES); + if (retval) + return retval; + old = current_cred(); + + if (!capable(CAP_SETUID)) { + if (ruid != (uid_t) -1 && ruid != old->uid && + ruid != old->euid && ruid != old->suid) + return -EPERM; + if (euid != (uid_t) -1 && euid != old->uid && + euid != old->euid && euid != old->suid) + return -EPERM; + if (suid != (uid_t) -1 && suid != old->uid && + suid != old->euid && suid != old->suid) + return -EPERM; + } + + if (ruid != (uid_t) -1) { + new->uid = ruid; + if (ruid != old->uid) { + retval = set_user(new); + if (retval < 0) + return retval; + } + } + if (euid != (uid_t) -1) + new->euid = euid; + if (suid != (uid_t) -1) + new->suid = suid; + new->fsuid = new->euid; + + return security_task_fix_setuid(new, old, LSM_SETID_RES); +} + +int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, + gid_t sgid) +{ + const struct cred *old = current_cred(); + int retval; + + retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES); + if (retval) + return retval; + + if (!capable(CAP_SETGID)) { + if (rgid != (gid_t) -1 && rgid != old->gid && + rgid != old->egid && rgid != old->sgid) + return -EPERM; + if (egid != (gid_t) -1 && egid != old->gid && + egid != old->egid && egid != old->sgid) + return -EPERM; + if (sgid != (gid_t) -1 && sgid != old->gid && + sgid != old->egid && sgid != old->sgid) + return -EPERM; + } + + if (rgid != (gid_t) -1) + new->gid = rgid; + if (egid != (gid_t) -1) + new->egid = egid; + if (sgid != (gid_t) -1) + new->sgid = sgid; + new->fsgid = new->egid; + return 0; +} + +int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid) +{ + const struct cred *old; + + old = current_cred(); + *old_fsuid = old->fsuid; + + if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0) + return -EPERM; + + if (uid == old->uid || uid == old->euid || + uid == old->suid || uid == old->fsuid || + capable(CAP_SETUID)) { + if (uid != *old_fsuid) { + new->fsuid = uid; + if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0) + return 0; + } + } + return -EPERM; +} + +int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid) +{ + const struct cred *old; + + old = current_cred(); + *old_fsgid = old->fsgid; + + if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS)) + return -EPERM; + + if (gid == old->gid || gid == old->egid || + gid == old->sgid || gid == old->fsgid || + capable(CAP_SETGID)) { + if (gid != *old_fsgid) { + new->fsgid = gid; + return 0; + } + } + return -EPERM; +} diff --git a/kernel/sys.c b/kernel/sys.c index 18bde97..0df737a 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -563,11 +563,12 @@ error: /* * change the user struct in a credentials set to match the new UID */ -static int set_user(struct cred *new) +int set_user(struct cred *new) { struct user_struct *new_user; - new_user = alloc_uid(current_user_ns(), new->uid); + /* is this ok? */ + new_user = alloc_uid(new->user->user_ns, new->uid); if (!new_user) return -EAGAIN; @@ -708,14 +709,12 @@ error: return retval; } - /* * This function implements a generic ability to update ruid, euid, * and suid. This allows you to implement the 4.4 compatible seteuid(). */ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid) { - const struct cred *old; struct cred *new; int retval; @@ -723,45 +722,10 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid) if (!new) return -ENOMEM; - retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES); - if (retval) - goto error; - old = current_cred(); - - retval = -EPERM; - if (!capable(CAP_SETUID)) { - if (ruid != (uid_t) -1 && ruid != old->uid && - ruid != old->euid && ruid != old->suid) - goto error; - if (euid != (uid_t) -1 && euid != old->uid && - euid != old->euid && euid != old->suid) - goto error; - if (suid != (uid_t) -1 && suid != old->uid && - suid != old->euid && suid != old->suid) - goto error; - } - - if (ruid != (uid_t) -1) { - new->uid = ruid; - if (ruid != old->uid) { - retval = set_user(new); - if (retval < 0) - goto error; - } - } - if (euid != (uid_t) -1) - new->euid = euid; - if (suid != (uid_t) -1) - new->suid = suid; - new->fsuid = new->euid; - - retval = security_task_fix_setuid(new, old, LSM_SETID_RES); - if (retval < 0) - goto error; - - return commit_creds(new); + retval = cred_setresuid(new, ruid, euid, suid); + if (retval == 0) + return commit_creds(new); -error: abort_creds(new); return retval; } @@ -783,43 +747,17 @@ SYSCALL_DEFINE3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __u */ SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid) { - const struct cred *old; struct cred *new; int retval; new = prepare_creds(); if (!new) return -ENOMEM; - old = current_cred(); - retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES); - if (retval) - goto error; + retval = cred_setresgid(new, rgid, egid, sgid); + if (retval == 0) + return commit_creds(new); - retval = -EPERM; - if (!capable(CAP_SETGID)) { - if (rgid != (gid_t) -1 && rgid != old->gid && - rgid != old->egid && rgid != old->sgid) - goto error; - if (egid != (gid_t) -1 && egid != old->gid && - egid != old->egid && egid != old->sgid) - goto error; - if (sgid != (gid_t) -1 && sgid != old->gid && - sgid != old->egid && sgid != old->sgid) - goto error; - } - - if (rgid != (gid_t) -1) - new->gid = rgid; - if (egid != (gid_t) -1) - new->egid = egid; - if (sgid != (gid_t) -1) - new->sgid = sgid; - new->fsgid = new->egid; - - return commit_creds(new); - -error: abort_creds(new); return retval; } @@ -836,7 +774,6 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u return retval; } - /* * "setfsuid()" sets the fsuid - the uid used for filesystem checks. This * is used for "access()" and for the NFS daemon (letting nfsd stay at @@ -845,35 +782,20 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u */ SYSCALL_DEFINE1(setfsuid, uid_t, uid) { - const struct cred *old; struct cred *new; uid_t old_fsuid; + int retval; new = prepare_creds(); if (!new) return current_fsuid(); - old = current_cred(); - old_fsuid = old->fsuid; - - if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0) - goto error; - - if (uid == old->uid || uid == old->euid || - uid == old->suid || uid == old->fsuid || - capable(CAP_SETUID)) { - if (uid != old_fsuid) { - new->fsuid = uid; - if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0) - goto change_okay; - } - } -error: - abort_creds(new); - return old_fsuid; + retval = cred_setfsuid(new, uid, &old_fsuid); + if (retval == 0) + commit_creds(new); + else + abort_creds(new); -change_okay: - commit_creds(new); return old_fsuid; } @@ -882,34 +804,20 @@ change_okay: */ SYSCALL_DEFINE1(setfsgid, gid_t, gid) { - const struct cred *old; struct cred *new; gid_t old_fsgid; + int retval; new = prepare_creds(); if (!new) return current_fsgid(); - old = current_cred(); - old_fsgid = old->fsgid; - - if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS)) - goto error; - - if (gid == old->gid || gid == old->egid || - gid == old->sgid || gid == old->fsgid || - capable(CAP_SETGID)) { - if (gid != old_fsgid) { - new->fsgid = gid; - goto change_okay; - } - } -error: - abort_creds(new); - return old_fsgid; + retval = cred_setfsgid(new, gid, &old_fsgid); + if (retval == 0) + commit_creds(new); + else + abort_creds(new); -change_okay: - commit_creds(new); return old_fsgid; } -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 15/96] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer 2010-03-17 16:08 ` [C/R v20][PATCH 14/96] c/r: split core function out of some set*{u,g}id functions Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 16/96] cgroup freezer: Update stale locking comments Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Matt Helsley, Cedric Le Goater, Paul Menage, Li Zefan, Rafael J. Wysocki, Pavel Machek, linux-pm From: Matt Helsley <matthltc@us.ibm.com> When the cgroup freezer is used to freeze tasks we do not want to thaw those tasks during resume. Currently we test the cgroup freezer state of the resuming tasks to see if the cgroup is FROZEN. If so then we don't thaw the task. However, the FREEZING state also indicates that the task should remain frozen. This also avoids a problem pointed out by Oren Ladaan: the freezer state transition from FREEZING to FROZEN is updated lazily when userspace reads or writes the freezer.state file in the cgroup filesystem. This means that resume will thaw tasks in cgroups which should be in the FROZEN state if there is no read/write of the freezer.state file to trigger this transition before suspend. NOTE: Another "simple" solution would be to always update the cgroup freezer state during resume. However it's a bad choice for several reasons: Updating the cgroup freezer state is somewhat expensive because it requires walking all the tasks in the cgroup and checking if they are each frozen. Worse, this could easily make resume run in N^2 time where N is the number of tasks in the cgroup. Finally, updating the freezer state from this code path requires trickier locking because of the way locks must be ordered. Instead of updating the freezer state we rely on the fact that lazy updates only manage the transition from FREEZING to FROZEN. We know that a cgroup with the FREEZING state may actually be FROZEN so test for that state too. This makes sense in the resume path even for partially-frozen cgroups -- those that really are FREEZING but not FROZEN. Reported-by: Oren Ladaan <orenl@cs.columbia.edu> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Cedric Le Goater <legoater@free.fr> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux-pm@lists.linux-foundation.org Seems like a candidate for -stable. --- include/linux/freezer.h | 7 +++++-- kernel/cgroup_freezer.c | 9 ++++++--- kernel/power/process.c | 2 +- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/include/linux/freezer.h b/include/linux/freezer.h index 5a361f8..da7e52b 100644 --- a/include/linux/freezer.h +++ b/include/linux/freezer.h @@ -64,9 +64,12 @@ extern bool freeze_task(struct task_struct *p, bool sig_only); extern void cancel_freezing(struct task_struct *p); #ifdef CONFIG_CGROUP_FREEZER -extern int cgroup_frozen(struct task_struct *task); +extern int cgroup_freezing_or_frozen(struct task_struct *task); #else /* !CONFIG_CGROUP_FREEZER */ -static inline int cgroup_frozen(struct task_struct *task) { return 0; } +static inline int cgroup_freezing_or_frozen(struct task_struct *task) +{ + return 0; +} #endif /* !CONFIG_CGROUP_FREEZER */ /* diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index 59e9ef6..eb3f34d 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -47,17 +47,20 @@ static inline struct freezer *task_freezer(struct task_struct *task) struct freezer, css); } -int cgroup_frozen(struct task_struct *task) +int cgroup_freezing_or_frozen(struct task_struct *task) { struct freezer *freezer; enum freezer_state state; task_lock(task); freezer = task_freezer(task); - state = freezer->state; + if (!freezer->css.cgroup->parent) + state = CGROUP_THAWED; /* root cgroup can't be frozen */ + else + state = freezer->state; task_unlock(task); - return state == CGROUP_FROZEN; + return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN); } /* diff --git a/kernel/power/process.c b/kernel/power/process.c index 5ade1bd..de53015 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -145,7 +145,7 @@ static void thaw_tasks(bool nosig_only) if (nosig_only && should_send_signal(p)) continue; - if (cgroup_frozen(p)) + if (cgroup_freezing_or_frozen(p)) continue; thaw_process(p); -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 16/96] cgroup freezer: Update stale locking comments 2010-03-17 16:08 ` [C/R v20][PATCH 15/96] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 17/96] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Matt Helsley, Oren Laadan, Cedric Le Goater, Paul Menage, Li Zefan From: Matt Helsley <matthltc@us.ibm.com> Update stale comments regarding locking order and add a little more detail so it's easier to follow the locking between the cgroup freezer and the power management freezer code. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Cedric Le Goater <legoater@free.fr> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> --- kernel/cgroup_freezer.c | 21 +++++++++++++-------- 1 files changed, 13 insertions(+), 8 deletions(-) diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index eb3f34d..2c44736 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -88,10 +88,10 @@ struct cgroup_subsys freezer_subsys; /* Locks taken and their ordering * ------------------------------ - * css_set_lock * cgroup_mutex (AKA cgroup_lock) - * task->alloc_lock (AKA task_lock) * freezer->lock + * css_set_lock + * task->alloc_lock (AKA task_lock) * task->sighand->siglock * * cgroup code forces css_set_lock to be taken before task->alloc_lock @@ -99,33 +99,38 @@ struct cgroup_subsys freezer_subsys; * freezer_create(), freezer_destroy(): * cgroup_mutex [ by cgroup core ] * - * can_attach(): - * cgroup_mutex + * freezer_can_attach(): + * cgroup_mutex (held by caller of can_attach) * - * cgroup_frozen(): + * cgroup_freezing_or_frozen(): * task->alloc_lock (to get task's cgroup) * * freezer_fork() (preserving fork() performance means can't take cgroup_mutex): - * task->alloc_lock (to get task's cgroup) * freezer->lock * sighand->siglock (if the cgroup is freezing) * * freezer_read(): * cgroup_mutex * freezer->lock + * write_lock css_set_lock (cgroup iterator start) + * task->alloc_lock * read_lock css_set_lock (cgroup iterator start) * * freezer_write() (freeze): * cgroup_mutex * freezer->lock + * write_lock css_set_lock (cgroup iterator start) + * task->alloc_lock * read_lock css_set_lock (cgroup iterator start) - * sighand->siglock + * sighand->siglock (fake signal delivery inside freeze_task()) * * freezer_write() (unfreeze): * cgroup_mutex * freezer->lock + * write_lock css_set_lock (cgroup iterator start) + * task->alloc_lock * read_lock css_set_lock (cgroup iterator start) - * task->alloc_lock (to prevent races with freeze_task()) + * task->alloc_lock (inside thaw_process(), prevents race with refrigerator()) * sighand->siglock */ static struct cgroup_subsys_state *freezer_create(struct cgroup_subsys *ss, -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 17/96] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint 2010-03-17 16:08 ` [C/R v20][PATCH 16/96] cgroup freezer: Update stale locking comments Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 18/96] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Matt Helsley, Oren Laadan, Paul Menage, Li Zefan, Cedric Le Goater From: Matt Helsley <matthltc@us.ibm.com> The CHECKPOINTING state prevents userspace from unfreezing tasks until sys_checkpoint() is finished. When doing container checkpoint userspace will do: echo FROZEN > /cgroups/my_container/freezer.state ... rc = sys_checkpoint( <pid of container root> ); To ensure a consistent checkpoint image userspace should not be allowed to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state) during checkpoint. "CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until the checkpoint system call is finished and ready to return. Then the freezer state returns to "FROZEN". Writing any new state to freezer.state while checkpointing will return EBUSY. These semantics ensure that userspace cannot unfreeze the cgroup midway through the checkpoint system call. The cgroup_freezer_begin_checkpoint() and cgroup_freezer_end_checkpoint() make relatively few assumptions about the task that is passed in. However the way they are called in do_checkpoint() assumes that the root of the container is in the same freezer cgroup as all the other tasks that will be checkpointed. Notes: As a side-effect this prevents the multiple tasks from entering the CHECKPOINTING state simultaneously. All but one will get -EBUSY. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Cedric Le Goater <legoater@free.fr> --- Documentation/cgroups/freezer-subsystem.txt | 10 ++ include/linux/freezer.h | 8 ++ kernel/cgroup_freezer.c | 166 ++++++++++++++++++++------- 3 files changed, 142 insertions(+), 42 deletions(-) diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt index 41f37fe..92b68e6 100644 --- a/Documentation/cgroups/freezer-subsystem.txt +++ b/Documentation/cgroups/freezer-subsystem.txt @@ -100,3 +100,13 @@ things happens: and returns EINVAL) 3) The tasks that blocked the cgroup from entering the "FROZEN" state disappear from the cgroup's set of tasks. + +When the cgroup freezer is used to guard container checkpoint operations the +freezer.state may be "CHECKPOINTING". "CHECKPOINTING" can only be set on a +"FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING" +state, the cgroup may not leave until the checkpoint system call returns the +freezer state to "FROZEN". Writing any new state to freezer.state while +checkpointing will return EBUSY. These semantics ensure that userspace cannot +unfreeze the cgroup midway through the checkpoint system call. Note that, +unlike "FROZEN" and "FREEZING", there is no corresponding "CHECKPOINTED" +state. diff --git a/include/linux/freezer.h b/include/linux/freezer.h index da7e52b..3d32641 100644 --- a/include/linux/freezer.h +++ b/include/linux/freezer.h @@ -65,11 +65,19 @@ extern void cancel_freezing(struct task_struct *p); #ifdef CONFIG_CGROUP_FREEZER extern int cgroup_freezing_or_frozen(struct task_struct *task); +extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q); +extern int cgroup_freezer_begin_checkpoint(struct task_struct *task); +extern void cgroup_freezer_end_checkpoint(struct task_struct *task); #else /* !CONFIG_CGROUP_FREEZER */ static inline int cgroup_freezing_or_frozen(struct task_struct *task) { return 0; } +static inline int in_same_cgroup_freezer(struct task_struct *p, + struct task_struct *q) +{ + return 0; +} #endif /* !CONFIG_CGROUP_FREEZER */ /* diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index 2c44736..dd87010 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -25,6 +25,7 @@ enum freezer_state { CGROUP_THAWED = 0, CGROUP_FREEZING, CGROUP_FROZEN, + CGROUP_CHECKPOINTING, }; struct freezer { @@ -63,6 +64,44 @@ int cgroup_freezing_or_frozen(struct task_struct *task) return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN); } +/* Task is frozen or will freeze immediately when next it gets woken */ +static bool is_task_frozen_enough(struct task_struct *task) +{ + return frozen(task) || + (task_is_stopped_or_traced(task) && freezing(task)); +} + +/* + * caller must hold freezer->lock + */ +static void update_freezer_state(struct cgroup *cgroup, + struct freezer *freezer) +{ + struct cgroup_iter it; + struct task_struct *task; + unsigned int nfrozen = 0, ntotal = 0; + + cgroup_iter_start(cgroup, &it); + while ((task = cgroup_iter_next(cgroup, &it))) { + ntotal++; + if (is_task_frozen_enough(task)) + nfrozen++; + } + + /* + * Transition to FROZEN when no new tasks can be added ensures + * that we never exist in the FROZEN state while there are unfrozen + * tasks. + */ + if (nfrozen == ntotal) + freezer->state = CGROUP_FROZEN; + else if (nfrozen > 0) + freezer->state = CGROUP_FREEZING; + else + freezer->state = CGROUP_THAWED; + cgroup_iter_end(cgroup, &it); +} + /* * cgroups_write_string() limits the size of freezer state strings to * CGROUP_LOCAL_BUFFER_SIZE @@ -71,6 +110,7 @@ static const char *freezer_state_strs[] = { "THAWED", "FREEZING", "FROZEN", + "CHECKPOINTING", }; /* @@ -78,9 +118,9 @@ static const char *freezer_state_strs[] = { * Transitions are caused by userspace writes to the freezer.state file. * The values in parenthesis are state labels. The rest are edge labels. * - * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN) - * ^ ^ | | - * | \_______THAWED_______/ | + * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN) --> (CHECKPOINTING) + * ^ ^ | | ^ | + * | \_______THAWED_______/ | \_____________/ * \__________________________THAWED____________/ */ @@ -153,13 +193,6 @@ static void freezer_destroy(struct cgroup_subsys *ss, kfree(cgroup_freezer(cgroup)); } -/* Task is frozen or will freeze immediately when next it gets woken */ -static bool is_task_frozen_enough(struct task_struct *task) -{ - return frozen(task) || - (task_is_stopped_or_traced(task) && freezing(task)); -} - /* * The call to cgroup_lock() in the freezer.state write method prevents * a write to that file racing against an attach, and hence the @@ -229,37 +262,6 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task) spin_unlock_irq(&freezer->lock); } -/* - * caller must hold freezer->lock - */ -static void update_freezer_state(struct cgroup *cgroup, - struct freezer *freezer) -{ - struct cgroup_iter it; - struct task_struct *task; - unsigned int nfrozen = 0, ntotal = 0; - - cgroup_iter_start(cgroup, &it); - while ((task = cgroup_iter_next(cgroup, &it))) { - ntotal++; - if (is_task_frozen_enough(task)) - nfrozen++; - } - - /* - * Transition to FROZEN when no new tasks can be added ensures - * that we never exist in the FROZEN state while there are unfrozen - * tasks. - */ - if (nfrozen == ntotal) - freezer->state = CGROUP_FROZEN; - else if (nfrozen > 0) - freezer->state = CGROUP_FREEZING; - else - freezer->state = CGROUP_THAWED; - cgroup_iter_end(cgroup, &it); -} - static int freezer_read(struct cgroup *cgroup, struct cftype *cft, struct seq_file *m) { @@ -330,7 +332,10 @@ static int freezer_change_state(struct cgroup *cgroup, freezer = cgroup_freezer(cgroup); spin_lock_irq(&freezer->lock); - + if (freezer->state == CGROUP_CHECKPOINTING) { + retval = -EBUSY; + goto out; + } update_freezer_state(cgroup, freezer); if (goal_state == freezer->state) goto out; @@ -398,3 +403,80 @@ struct cgroup_subsys freezer_subsys = { .fork = freezer_fork, .exit = NULL, }; + +#ifdef CONFIG_CHECKPOINT +/* + * Caller is expected to ensure that neither @p nor @q may change its + * freezer cgroup during this test in a way that may affect the result. + * E.g., when called form c/r, @p must be in CHECKPOINTING cgroup, so + * may not change cgroup, and either @q is also there, or is not there + * and may not join. + */ +int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q) +{ + struct cgroup_subsys_state *p_css, *q_css; + + task_lock(p); + p_css = task_subsys_state(p, freezer_subsys_id); + task_unlock(p); + + task_lock(q); + q_css = task_subsys_state(q, freezer_subsys_id); + task_unlock(q); + + return (p_css == q_css); +} + +/* + * cgroup freezer state changes made without the aid of the cgroup filesystem + * must go through this function to ensure proper locking is observed. + */ +static int freezer_checkpointing(struct task_struct *task, + enum freezer_state next_state) +{ + struct freezer *freezer; + struct cgroup_subsys_state *css; + enum freezer_state state; + + task_lock(task); + css = task_subsys_state(task, freezer_subsys_id); + css_get(css); /* make sure freezer doesn't go away */ + freezer = container_of(css, struct freezer, css); + task_unlock(task); + + if (freezer->state == CGROUP_FREEZING) { + /* May be in middle of a lazy FREEZING -> FROZEN transition */ + if (cgroup_lock_live_group(css->cgroup)) { + spin_lock_irq(&freezer->lock); + update_freezer_state(css->cgroup, freezer); + spin_unlock_irq(&freezer->lock); + cgroup_unlock(); + } + } + + spin_lock_irq(&freezer->lock); + state = freezer->state; + if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) || + (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN)) + freezer->state = next_state; + spin_unlock_irq(&freezer->lock); + css_put(css); + return state; +} + +int cgroup_freezer_begin_checkpoint(struct task_struct *task) +{ + if (freezer_checkpointing(task, CGROUP_CHECKPOINTING) != CGROUP_FROZEN) + return -EBUSY; + return 0; +} + +void cgroup_freezer_end_checkpoint(struct task_struct *task) +{ + /* + * If we weren't in CHECKPOINTING state then userspace could have + * unfrozen a task and given us an inconsistent checkpoint image + */ + WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING); +} +#endif /* CONFIG_CHECKPOINT */ -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 18/96] cgroup freezer: interface to freeze a cgroup from within the kernel 2010-03-17 16:08 ` [C/R v20][PATCH 17/96] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 19/96] Namespaces submenu Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan, Matt Helsley, Paul Menage, Li Zefan, Cedric Le Goater Add public interface to freeze a cgroup freezer given a task that belongs to that cgroup: cgroup_freezer_make_frozen(task) Freezing the root cgroup is not permitted. Freezing the cgroup to which current process belong is also not permitted. This will be used for restart(2) to be able to leave the restarted processes in a frozen state, instead of resuming execution. This is useful for debugging, if the user would like to attach a debugger to the restarted task(s). It is also useful if the restart procedure would like to perform additional setup once the tasks are restored but before they are allowed to proceed execution. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> CC: Matt Helsley <matthltc@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Cedric Le Goater <legoater@free.fr> --- include/linux/freezer.h | 1 + kernel/cgroup_freezer.c | 27 +++++++++++++++++++++++++++ 2 files changed, 28 insertions(+), 0 deletions(-) diff --git a/include/linux/freezer.h b/include/linux/freezer.h index 3d32641..0cb22cb 100644 --- a/include/linux/freezer.h +++ b/include/linux/freezer.h @@ -68,6 +68,7 @@ extern int cgroup_freezing_or_frozen(struct task_struct *task); extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q); extern int cgroup_freezer_begin_checkpoint(struct task_struct *task); extern void cgroup_freezer_end_checkpoint(struct task_struct *task); +extern int cgroup_freezer_make_frozen(struct task_struct *task); #else /* !CONFIG_CGROUP_FREEZER */ static inline int cgroup_freezing_or_frozen(struct task_struct *task) { diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index dd87010..efd4597 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -479,4 +479,31 @@ void cgroup_freezer_end_checkpoint(struct task_struct *task) */ WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING); } + +int cgroup_freezer_make_frozen(struct task_struct *task) +{ + struct freezer *freezer; + struct cgroup_subsys_state *css; + int ret = -ENODEV; + + task_lock(task); + css = task_subsys_state(task, freezer_subsys_id); + css_get(css); /* make sure freezer doesn't go away */ + freezer = container_of(css, struct freezer, css); + task_unlock(task); + + /* Never freeze the root cgroup */ + if (!test_bit(CSS_ROOT, &css->flags) && + cgroup_lock_live_group(css->cgroup)) { + /* do not freeze outselves, ei ?! */ + if (css != task_subsys_state(current, freezer_subsys_id)) + ret = freezer_change_state(css->cgroup, CGROUP_FROZEN); + else + ret = -EPERM; + cgroup_unlock(); + } + + css_put(css); + return ret; +} #endif /* CONFIG_CHECKPOINT */ -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 19/96] Namespaces submenu 2010-03-17 16:08 ` [C/R v20][PATCH 18/96] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Dave Hansen From: Dave Hansen <dave@linux.vnet.ibm.com> Let's not steal too much space in the 'General Setup' menu. Take a cue from the cgroups code and create a submenu. This can go upstream now. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> --- init/Kconfig | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index d95ca7c..0c00a78 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -668,7 +668,7 @@ config RELAY If unsure, say N. -config NAMESPACES +menuconfig NAMESPACES bool "Namespaces support" if EMBEDDED default !EMBEDDED help -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public 2010-03-17 16:08 ` [C/R v20][PATCH 19/96] Namespaces submenu Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 21/96] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan These two are used in the next patch when calling vfs_read/write() Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> --- fs/read_write.c | 10 ---------- include/linux/fs.h | 10 ++++++++++ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index b7f4a1f..e258301 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ EXPORT_SYMBOL(vfs_write); -static inline loff_t file_pos_read(struct file *file) -{ - return file->f_pos; -} - -static inline void file_pos_write(struct file *file, loff_t pos) -{ - file->f_pos = pos; -} - SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct file *file; diff --git a/include/linux/fs.h b/include/linux/fs.h index ebb1cd5..6c08df2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, struct iovec *fast_pointer, struct iovec **ret_pointer); +static inline loff_t file_pos_read(struct file *file) +{ + return file->f_pos; +} + +static inline void file_pos_write(struct file *file, loff_t pos) +{ + file->f_pos = pos; +} + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_readv(struct file *, const struct iovec __user *, -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 21/96] c/r: create syscalls: sys_checkpoint, sys_restart 2010-03-17 16:08 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 22/96] c/r: documentation Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan, Dave Hansen Create trivial sys_checkpoint and sys_restore system calls. They will enable to checkpoint and restart an entire container, to and from a checkpoint image file descriptor. The syscalls take a pid, a file descriptor (for the image file) and flags as arguments. The pid identifies the top-most (root) task in the process tree, e.g. the container init: for sys_checkpoint the first argument identifies the pid of the target container/subtree; for sys_restart it will identify the pid of restarting root task. A checkpoint, much like a process coredump, dumps the state of multiple processes at once, including the state of the container. The checkpoint image is written to (and read from) the file descriptor directly from the kernel. This way the data is generated and then pushed out naturally as resources and tasks are scanned to save their state. This is the approach taken by, e.g., Zap and OpenVZ. By using a return value and not a file descriptor, we can distinguish between a return from checkpoint, a return from restart (in case of a checkpoint that includes self, i.e. a task checkpointing its own container, or itself), and an error condition, in a manner analogous to a fork() call. We don't use copy_from_user()/copy_to_user() because it requires holding the entire image in user space, and does not make sense for restart. Also, we don't use a pipe, pseudo-fs file and the like, because they work by generating data on demand as the user pulls it (unless the entire image is buffered in the kernel) and would require more complex logic. They also would significantly complicate checkpoint that includes self. Changelog[v19-rc1]: - Add 'int logfd' to prototype of sys_{checkpoint,restart} Changelog[v18]: - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile Changelog[v17]: - Move checkpoint closer to namespaces (kconfig) - Kill "Enable" in c/r config option Changelog[v16]: - Change sys_restart() first argument to be 'pid_t pid' Changelog[v14]: - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo) - Remove line 'def_bool n' (default is already 'n') - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch) Changelog[v5]: - Config is 'def_bool n' by default Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- Makefile | 2 +- arch/x86/Kconfig | 4 +++ arch/x86/include/asm/unistd_32.h | 4 ++- arch/x86/kernel/syscall_table_32.S | 2 + checkpoint/Kconfig | 14 +++++++++++ checkpoint/Makefile | 5 ++++ checkpoint/sys.c | 45 ++++++++++++++++++++++++++++++++++++ include/linux/syscalls.h | 4 +++ init/Kconfig | 2 + kernel/sys_ni.c | 4 +++ 10 files changed, 84 insertions(+), 2 deletions(-) create mode 100644 checkpoint/Kconfig create mode 100644 checkpoint/Makefile create mode 100644 checkpoint/sys.c diff --git a/Makefile b/Makefile index 1b24895..1452de1 100644 --- a/Makefile +++ b/Makefile @@ -409,7 +409,7 @@ endif # of make so .config is not included in this case either (for *config). no-dot-config-targets := clean mrproper distclean \ - cscope TAGS tags help %docs check% \ + cscope TAGS tags help %docs checkstack \ include/linux/version.h headers_% \ kernelrelease kernelversion diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index eb40925..d5a7284 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -91,6 +91,10 @@ config STACKTRACE_SUPPORT config HAVE_LATENCYTOP_SUPPORT def_bool y +config CHECKPOINT_SUPPORT + bool + default y if X86_32 + config MMU def_bool y diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index cd7ca6a..55b7cae 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -344,10 +344,12 @@ #define __NR_perf_event_open 336 #define __NR_recvmmsg 337 #define __NR_eclone 338 +#define __NR_checkpoint 339 +#define __NR_restart 340 #ifdef __KERNEL__ -#define NR_syscalls 339 +#define NR_syscalls 341 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index 22ae7ef..899a4f1 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -338,3 +338,5 @@ ENTRY(sys_call_table) .long sys_perf_event_open .long sys_recvmmsg .long ptregs_eclone + .long sys_checkpoint + .long sys_restart /* 340 */ diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig new file mode 100644 index 0000000..ef7d406 --- /dev/null +++ b/checkpoint/Kconfig @@ -0,0 +1,14 @@ +# Architectures should define CHECKPOINT_SUPPORT when they have +# implemented the hooks for processor state etc. needed by the +# core checkpoint/restart code. + +config CHECKPOINT + bool "Checkpoint/restart (EXPERIMENTAL)" + depends on CHECKPOINT_SUPPORT && EXPERIMENTAL + help + Application checkpoint/restart is the ability to save the + state of a running application so that it can later resume + its execution from the time at which it was checkpointed. + + Turning this option on will enable checkpoint and restart + functionality in the kernel. diff --git a/checkpoint/Makefile b/checkpoint/Makefile new file mode 100644 index 0000000..8a32c6f --- /dev/null +++ b/checkpoint/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for linux checkpoint/restart. +# + +obj-$(CONFIG_CHECKPOINT) += sys.o diff --git a/checkpoint/sys.c b/checkpoint/sys.c new file mode 100644 index 0000000..a81750a --- /dev/null +++ b/checkpoint/sys.c @@ -0,0 +1,45 @@ +/* + * Generic container checkpoint-restart + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/sched.h> +#include <linux/kernel.h> +#include <linux/syscalls.h> + +/** + * sys_checkpoint - checkpoint a container + * @pid: pid of the container init(1) process + * @fd: file to which dump the checkpoint image + * @flags: checkpoint operation flags + * @logfd: fd to which to dump debug and error messages + * + * Returns positive identifier on success, 0 when returning from restart + * or negative value on error + */ +SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd, + unsigned long, flags, int, logfd) +{ + return -ENOSYS; +} + +/** + * sys_restart - restart a container + * @pid: pid of task root (in coordinator's namespace), or 0 + * @fd: file from which read the checkpoint image + * @flags: restart operation flags + * @logfd: fd to which to dump debug and error messages + * + * Returns negative value on error, or otherwise returns in the realm + * of the original checkpoint + */ +SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, + unsigned long, flags, int, logfd) +{ + return -ENOSYS; +} diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 207466a..3d80ac0 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -825,6 +825,10 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *, asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int, struct timespec __user *, const sigset_t __user *, size_t); +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags, + int logfd); +asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags, + int logfd); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); diff --git a/init/Kconfig b/init/Kconfig index 0c00a78..4640375 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -719,6 +719,8 @@ config NET_NS Allow user space to create what appear to be multiple instances of the network stack. +source "checkpoint/Kconfig" + config BLK_DEV_INITRD bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support" depends on BROKEN || !FRV diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 695384f..9c6fab7 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -180,3 +180,7 @@ cond_syscall(sys_eventfd2); /* performance counters: */ cond_syscall(sys_perf_event_open); + +/* checkpoint/restart */ +cond_syscall(sys_checkpoint); +cond_syscall(sys_restart); -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 22/96] c/r: documentation 2010-03-17 16:08 ` [C/R v20][PATCH 21/96] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 23/96] c/r: basic infrastructure for checkpoint/restart Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan, Dave Hansen Covers application checkpoint/restart, overall design, interfaces, usage, shared objects, and and checkpoint image format. Changelog[v19-rc1]: - Update documentation and examples for new syscalls API - [Liu Alexander] Fix typos - [Serge Hallyn] Update checkpoint image format Changelog[v16]: - Update documentation - Unify into readme.txt and usage.txt Changelog[v14]: - Discard the 'h.parent' field - New image format (shared objects appear before they are referenced unless they are compound) Changelog[v8]: - Split into multiple files in Documentation/checkpoint/... - Extend documentation, fix typos and comments from feedback Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- Documentation/checkpoint/checkpoint.c | 38 +++ Documentation/checkpoint/readme.txt | 370 ++++++++++++++++++++++++++++ Documentation/checkpoint/self_checkpoint.c | 69 +++++ Documentation/checkpoint/self_restart.c | 40 +++ Documentation/checkpoint/usage.txt | 247 +++++++++++++++++++ 5 files changed, 764 insertions(+), 0 deletions(-) create mode 100644 Documentation/checkpoint/checkpoint.c create mode 100644 Documentation/checkpoint/readme.txt create mode 100644 Documentation/checkpoint/self_checkpoint.c create mode 100644 Documentation/checkpoint/self_restart.c create mode 100644 Documentation/checkpoint/usage.txt diff --git a/Documentation/checkpoint/checkpoint.c b/Documentation/checkpoint/checkpoint.c new file mode 100644 index 0000000..8560f30 --- /dev/null +++ b/Documentation/checkpoint/checkpoint.c @@ -0,0 +1,38 @@ +#include <stdio.h> +#include <stdlib.h> +#include <errno.h> +#include <unistd.h> +#include <sys/syscall.h> + +#include <linux/checkpoint.h> + +static inline int checkpoint(pid_t pid, int fd, unsigned long flags) +{ + return syscall(__NR_checkpoint, pid, fd, flags); +} + +int main(int argc, char *argv[]) +{ + pid_t pid; + int ret; + + if (argc != 2) { + printf("usage: ckpt PID\n"); + exit(1); + } + + pid = atoi(argv[1]); + if (pid <= 0) { + printf("invalid pid\n"); + exit(1); + } + + ret = checkpoint(pid, STDOUT_FILENO, CHECKPOINT_SUBTREE); + + if (ret < 0) + perror("checkpoint"); + else + printf("checkpoint id %d\n", ret); + + return (ret > 0 ? 0 : 1); +} diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt new file mode 100644 index 0000000..4fa5560 --- /dev/null +++ b/Documentation/checkpoint/readme.txt @@ -0,0 +1,370 @@ + + Checkpoint-Restart support in the Linux kernel + ========================================================== + +Copyright (C) 2008-2010 Oren Laadan + +Author: Oren Laadan <orenl@cs.columbia.edu> + +License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) + +Contributors: Oren Laadan <orenl@cs.columbia.edu> + Serge Hallyn <serue@us.ibm.com> + Dan Smith <danms@us.ibm.com> + Matt Helsley <matthltc@us.ibm.com> + Nathan Lynch <ntl@pobox.com> + Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> + Dave Hansen <dave@linux.vnet.ibm.com> + + +Introduction +============ + +Application checkpoint/restart [C/R] is the ability to save the state +of a running application so that it can later resume its execution +from the time at which it was checkpointed. An application can be +migrated by checkpointing it on one machine and restarting it on +another. C/R can provide many potential benefits: + +* Failure recovery: by rolling back to a previous checkpoint + +* Improved response time: by restarting applications from checkpoints + instead of from scratch. + +* Improved system utilization: by suspending long running CPU + intensive jobs and resuming them when load decreases. + +* Fault resilience: by migrating applications off faulty hosts. + +* Dynamic load balancing: by migrating applications to less loaded + hosts. + +* Improved service availability and administration: by migrating + applications before host maintenance so that they continue to run + with minimal downtime + +* Time-travel: by taking periodic checkpoints and restarting from + any previous checkpoint. + +Compared to hypervisor approaches, application C/R is more lightweight +since it need only save the state associated with applications, while +operating system data structures (e.g. buffer cache, drivers state +and the like) are uninteresting. + + +Overall design +============== + +Checkpoint and restart are done in the kernel as much as possible. +Two new system calls are introduced to provide C/R: sys_checkpoint() +and sys_restart(). They both operate on a process tree (hierarchy), +either a whole container or a subtree of a container. + +Checkpointing entire containers ensures that there are no dependencies +on anything outside the container, which guarantees that a matching +restart will succeed (assuming that the file system state remains +consistent). However, it requires that users will always run the tasks +that they wish to checkpoint inside containers. This is ideal for, +e.g., private virtual servers and the like. + +In contrast, when checkpointing a subtree of a container it is up to +the user to ensure that dependencies either don't exist or can be +safely ignored. This is useful, for instance, for HPC scenarios or +even a user that would like to periodically checkpoint a long-running +batch job. + +An additional system call, a la madvise(), is planned, so that tasks +can advise the kernel how to handle specific resources. For instance, +a task could ask to skip a memory area at checkpoint to save space, +or to use a preset file descriptor at restart instead of restoring it +from the checkpoint image. It will provide the flexibility that is +particularly useful to address the needs of a diverse crowd of users +and use-cases. + +Syscall sys_checkpoint() is given a pid that indicates the top of the +hierarchy, a file descriptor to store the image, and flags. The code +serializes internal user- and kernel-state and writes it out to the +file descriptor. The resulting image is stream-able. The processes are +expected to be frozen for the duration of the checkpoint. + +In general, a checkpoint consists of 5 steps: +1. Pre-dump +2. Freeze the container/subtree +3. Save tasks' and kernel state <-- sys_checkpoint() +4. Thaw (or kill) the container/subtree +5. Post-dump + +Step 3 is done by calling sys_checkpoint(). Steps 1 and 5 are an +optimization to reduce application downtime. In particular, "pre-dump" +works before freezing the container, e.g. the pre-copy for live +migration, and "post-dump" works after the container resumes +execution, e.g. write-back the data to secondary storage. + +The kernel exports a relatively opaque 'blob' of data to userspace +which can then be handed to the new kernel at restart time. The +'blob' contains data and state of select portions of kernel structures +such as VMAs and mm_structs, as well as copies of the actual memory +that the tasks use. Any changes in this blob's format between kernel +revisions can be handled by an in-userspace conversion program. + +To restart, userspace first create a process hierarchy that matches +that of the checkpoint, and each task calls sys_restart(). The syscall +reads the saved kernel state from a file descriptor, and re-creates +the resources that the tasks need to resume execution. The restart +code is executed by each task that is restored in the new hierarchy to +reconstruct its own state. + +In general, a restart consists of 3 steps: +1. Create hierarchy +2. Restore tasks' and kernel state <-- sys_restart() +3. Resume userspace (or freeze tasks) + +Because the process hierarchy, during restart in created in userspace, +the restarting tasks have the flexibility to prepare before calling +sys_restart(). + + +Checkpoint image format +======================= + +The checkpoint image format is built of records that consist of a +pre-header identifying its contents, followed by a payload. This +format allow userspace tools to easily parse and skip through the +image without requiring intimate knowledge of the data. It will also +be handy to enable parallel checkpointing in the future where multiple +threads interleave data from multiple processes into a single stream. + +The pre-header is defined by 'struct ckpt_hdr' as follows: @type +identifies the type of the payload, @len tells its length in bytes +including the pre-header. + +struct ckpt_hdr { + __s32 type; + __s32 len; +}; + +The pre-header must be the first component in all other headers. For +instance, the task data is saved in 'struct ckpt_hdr_task', which +looks something like this: + +struct ckpt_hdr_task { + struct ckpt_hdr h; + __u32 pid; + ... +}; + +THE IMAGE FORMAT IS EXPECTED TO CHANGE over time as more features are +supported, or as existing features change in the kernel and require to +adjust their representation. Any such changes will be be handled by +in-userspace conversion tools. + +The general format of the checkpoint image is as follows: +* Image header +* Container configuration +* Task hierarchy +* Tasks' state +* Image trailer + +The image always begins with a general header that holds a magic +number, an architecture identifier (little endian format), a format +version number (@rev), followed by information about the kernel +(currently version and UTS data). It also holds the time of the +checkpoint and the flags given to sys_checkpoint(). This header is +followed by an arch-specific header. + +The container configuration section containers information that is +global to the container. Security (LSM) configuration is one example. +Network configuration and container-wide mounts may also go here, so +that the userspace restart coordinator can re-create a suitable +environment. + +The task hierarchy comes next so that userspace tools can read it +early (even from a stream) and re-create the restarting tasks. This is +basically an array of all checkpointed tasks, and their relationships +(parent, siblings, threads, etc). + +Then the state of all tasks is saved, in the order that they appear in +the tasks array above. For each state, we save data like task_struct, +namespaces, open files, memory layout, memory contents, cpu state, +signals and signal handlers, etc. For resources that are shared among +multiple processes, we first checkpoint said resource (and only once), +and in the task data we give a reference to it. More about shared +resources below. + +Finally, the image always ends with a trailer that holds a (different) +magic number, serving for sanity check. + + +Shared objects +============== + +Many resources may be shared by multiple tasks (e.g. file descriptors, +memory address space, etc), or even have multiple references from +other resources (e.g. a single inode that represents two ends of a +pipe). + +Shared objects are tracked using a hash table (objhash) to ensure that +they are only checkpointed or restored once. To handle a shared +object, it is first looked up in the hash table, to determine if is +the first encounter or a recurring appearance. The hash table itself +is not saved as part of the checkpoint image: it is constructed +dynamically during both checkpoint and restart, and discarded at the +end of the operation. + +During checkpoint, when a shared object is encountered for the first +time, it is inserted to the hash table, indexed by its kernel address. +It is assigned an identifier (@objref) in order of appearance, and +then its state is saved. Subsequent lookups of that object in the hash +will yield that entry, in which case only the @objref is saved, as +opposed the entire state of the object. + +During restart, shared objects are indexed by their @objref as given +during the checkpoint. On the first appearance of each shared object, +a new resource will be created and its state restored from the image. +Then the object is added to the hash table. Subsequent lookups of the +same unique identifier in the hash table will yield that entry, and +then the existing object instance is reused instead of creating +a new one. + +The hash grabs a reference to each object that is inserted, and +maintains this reference for the entire lifetime of the hash. Thus, +it is always safe to reference an object that is stored in the hash. +The hash is "one-way" in the sense that objects that are added are +never deleted from the hash until the hash is discarded. This, in +turn, happens only when the checkpoint (or restart) terminates. + +Shared objects are thus saved when they are first seen, and _before_ +the parent object that uses them. Therefore by the time the parent +objects needs them, they should already be in the objhash. The one +exception is when more than a single shared resource will be restarted +at once (e.g. like the two ends of a pipe, or all the namespaces in an +nsproxy). In this case the parent object is dumped first followed by +the individual sub-resources). + +The checkpoint image is stream-able, meaning that restarting from it +may not require lseek(). This is enforced at checkpoint time, by +carefully selecting the order of shared objects, to respect the rule +that an object is always saved before the objects that refers to it. + + +Memory contents format +====================== + +The memory contents of a given memory address space (->mm) is dumped +as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'. +This header details the vma properties, and a reference to a file +(if file backed) or an inode (or shared memory) object. + +The vma header is followed by the actual contents - but only those +pages that need to be saved, i.e. dirty pages. They are written in +chunks of data, where each chunks contains a header that indicates +that number of pages in the chunk, followed by an array of virtual +addresses and then an array of actual page contents. The last chunk +holds zero pages. + +To illustrate this, consider a single simple task with two vmas: one +is file mapped with two dumped pages, and the other is anonymous with +three dumped pages. The memory dump will look like this: + + ckpt_hdr + ckpt_hdr_vma + ckpt_hdr_pgarr (nr_pages = 2) + addr1, addr2 + page1, page2 + ckpt_hdr_pgarr (nr_pages = 0) + ckpt_hdr + ckpt_hdr_vma + ckpt_hdr_pgarr (nr_pages = 3) + addr3, addr4, addr5 + page3, page4, page5 + ckpt_hdr_pgarr (nr_pages = 0) + + +Error handling +============== + +Both checkpoint and restart operations may fail due to a variety of +reasons. Using a simple, single return value from the system call is +insufficient to report the reason of a failure. + +Instead, both sys_checkpoint() and sys_restart() accept an additional +argument - a file descriptor to which the kernel writes diagnostic +and debugging information. Both the checkpoint and restart userspace +utilities have options to specify a filename to store this log. + +In addition, checkpoint provides informative status report upon +failure in the checkpoint image in the form of (one or more) error +objects, 'struct ckpt_hdr_err'. An error objects consists of a +mandatory pre-header followed by a null character ('\0'), and then a +string that describes the error. By default, if an error occurs, this +will be the last object written to the checkpoint image. + +Upon failure, the caller can examine the image (e.g. with 'ckptinfo') +and extract the detailed error message. The leading '\0' is useful if +one wants to seek back from the end of the checkpoint image, instead +of parsing the entire image separately. + + +Security +======== + +The main question is whether sys_checkpoint() and sys_restart() +require privileged or unprivileged operation. + +Early versions checked capable(CAP_SYS_ADMIN) assuming that we would +attempt to remove the need for privilege, so that all users could +safely use it. Arnd Bergmann pointed out that it'd make more sense to +let unprivileged users use them now, so that we'll be more careful +about the security as patches roll in. + +Checkpoint: the main concern is whether a task that performs the +checkpoint of another task has sufficient privileges to access its +state. We address this by requiring that the checkpointer task will be +able to ptrace the target task, by means of ptrace_may_access() with +access mode. + +Restart: the main concern is that we may allow an unprivileged user to +feed the kernel with random data. To this end, the restart works in a +way that does not skip the usual security checks. Task credentials, +i.e. euid, reuid, and LSM security contexts currently come from the +caller, not the checkpoint image. As credentials are restored too, +the ability of a task that calls sys_restore() to setresuid/setresgid +to those values must be checked. + +Keeping the restart procedure to operate within the limits of the +caller's credentials means that there various scenarios that cannot +be supported. For instance, a setuid program that opened a protected +log file and then dropped privileges will fail the restart, because +the user won't have enough credentials to reopen the file. In these +cases, we should probably treat restarting like inserting a kernel +module: surely the user can cause havoc by providing incorrect data, +but then again we must trust the root account. + +So that's why we don't want CAP_SYS_ADMIN required up-front. That way +we will be forced to more carefully review each of those features. +However, this can be controlled with a sysctl-variable. + + +Kernel interfaces +================= + +* To checkpoint a vma, the 'struct vm_operations_struct' needs to + provide a method ->checkpoint: + int checkpoint(struct ckpt_ctx *, struct vma_struct *) + Restart requires a matching (exported) restore: + int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *) + +* To checkpoint a file, the 'struct file_operations' needs to provide + the methods ->checkpoint and ->collect: + int checkpoint(struct ckpt_ctx *, struct file *) + int collect(struct ckpt_ctx *, struct file *) + Restart requires a matching (exported) restore: + int restore(struct ckpt_ctx *, struct ckpt_hdr_file *) + For most file systems, generic_file_{checkpoint,restore}() can be + used. + +* To checkpoint a socket, the 'struct proto_ops' needs to provide + the methods ->checkpoint, ->collect and ->restore: + int checkpoint(struct ckpt_ctx *ctx, struct socket *sock); + int collect(struct ckpt_ctx *ctx, struct socket *sock); + int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h) + diff --git a/Documentation/checkpoint/self_checkpoint.c b/Documentation/checkpoint/self_checkpoint.c new file mode 100644 index 0000000..27dba0d --- /dev/null +++ b/Documentation/checkpoint/self_checkpoint.c @@ -0,0 +1,69 @@ +/* + * self_checkpoint.c: demonstrate self-checkpoint + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <errno.h> +#include <math.h> +#include <sys/syscall.h> + +#include <linux/checkpoint.h> + +static inline int checkpoint(pid_t pid, int fd, unsigned long flags) +{ + return syscall(__NR_checkpoint, pid, fd, flags, CHECKPOINT_FD_NONE); +} + +#define OUTFILE "/tmp/cr-self.out" + +int main(int argc, char *argv[]) +{ + pid_t pid = getpid(); + FILE *file; + int i, ret; + + close(0); + close(2); + + unlink(OUTFILE); + file = fopen(OUTFILE, "w+"); + if (!file) { + perror("open"); + exit(1); + } + if (dup2(0, 2) < 0) { + perror("dup2"); + exit(1); + } + + fprintf(file, "hello, world!\n"); + fflush(file); + + for (i = 0; i < 1000; i++) { + sleep(1); + fprintf(file, "count %d\n", i); + fflush(file); + + if (i != 2) + continue; + ret = checkpoint(pid, STDOUT_FILENO, CHECKPOINT_SUBTREE); + if (ret < 0) { + fprintf(file, "ckpt: %s\n", strerror(errno)); + exit(2); + } + + fprintf(file, "checkpoint ret: %d\n", ret); + fflush(file); + } + + return 0; +} diff --git a/Documentation/checkpoint/self_restart.c b/Documentation/checkpoint/self_restart.c new file mode 100644 index 0000000..647ce51 --- /dev/null +++ b/Documentation/checkpoint/self_restart.c @@ -0,0 +1,40 @@ +/* + * self_restart.c: demonstrate self-restart + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */ + +#include <stdio.h> +#include <stdlib.h> +#include <errno.h> +#include <fcntl.h> +#include <unistd.h> +#include <unistd.h> +#include <sys/syscall.h> + +#include <linux/checkpoint.h> + +static inline int restart(pid_t pid, int fd, unsigned long flags) +{ + return syscall(__NR_restart, pid, fd, flags, CHECKPOINT_FD_NONE); +} + +int main(int argc, char *argv[]) +{ + pid_t pid = getpid(); + int ret; + + ret = restart(pid, STDIN_FILENO, RESTART_TASKSELF); + if (ret < 0) + perror("restart"); + + printf("should not reach here !\n"); + + return 0; +} diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt new file mode 100644 index 0000000..c6fc045 --- /dev/null +++ b/Documentation/checkpoint/usage.txt @@ -0,0 +1,247 @@ + + How to use Checkpoint-Restart + ========================================= + + +API +=== + +The API consists of three new system calls: + +* long checkpoint(pid_t pid, int fd, unsigned long flag, int logfd); + + Checkpoint a (sub-)container whose root task is identified by @pid, + to the open file indicated by @fd. If @logfd isn't -1, it indicates + an open file to which error and debug messages are written. @flags + may be one or more of: + - CHECKPOINT_SUBTREE : allow checkpoint of sub-container + (other value are not allowed). + + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if + it returns from a restart, and -1 if an error occurs. The ckptid will + uniquely identify a checkpoint image, for as long as the checkpoint + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a + partial checkpoint, residing in kernel memory). + +* long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd); + + Restart a process hierarchy from a checkpoint image that is read from + the blob stored in the file indicated by @fd. If @logfd isn't -1, it + indicates an open file to which error and debug messages are written. + @flags will have future meaning (must be 0 for now). @pid indicates + the root of the hierarchy as seen in the coordinator's pid-namespace, + and is expected to be a child of the coordinator. @flags may be one + or more of: + - RESTART_TASKSELF : (self) restart of a single process + - RESTART_FROEZN : processes remain frozen once restart completes + - RESTART_GHOST : process is a ghost (placeholder for a pid) + (Note that this argument may mean 'ckptid' to identify an in-kernel + checkpoint image, with some @flags in the future). + + Returns: -1 if an error occurs, 0 on success when restarting from a + "self" checkpoint, and return value of system call at the time of the + checkpoint when restarting from an "external" checkpoint. + + (If a process was frozen for checkpoint while in userspace, it will + resume running in userspace exactly where it was interrupted. If it + was frozen while in kernel doing a syscall, it will return what the + syscall returned when interrupted/completed, and proceed from there + as if it had only been frozen and then thawed. Finally, if it did a + self-checkpoint, it will resume to the first instruction after the + call to checkpoint(2), having returned 0, to indicate whether the + return is from the checkpoint or a restart). + +* int clone_with_pid(unsigned long clone_flags, void *news, + int *parent_tidptr, int *child_tidptr, + struct target_pid_set *pid_set) + + struct target_pid_set { + int num_pids; + pid_t *target_pids; + } + + Container restart requires that a task have the same pid it had when + it was checkpointed. When containers are nested the tasks within the + containers exist in multiple pid namespaces and hence have multiple + pids to specify during restart. + + clone_with_pids(), intended for use during restart, is similar to + clone(), except that it takes a 'target_pid_set' parameter. This + parameter lets caller choose specific pid numbers for the child + process, in the process's active and ancestor pid namespaces. + + Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for + now, to prevent unprivileged processes from misusing this interface. + + If a target-pid is 0, the kernel continues to assign a pid for the + process in that namespace. If a requested pid is taken, the system + call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current + nesting level of pid namespaces, the system call fails with -EINVAL. + + +Sysctl/proc +=========== + +/proc/sys/kernel/ckpt_unpriv_allowed [default = 1] + controls whether c/r operation is allowed for unprivileged users + + +Operation +========= + +The granularity of a checkpoint usually is a process hierarchy. The +'pid' argument is interpreted in the caller's pid namespace. So to +checkpoint a container whose init task (pid 1 in that pidns) appears +as pid 3497 the caller's pidns, the caller must use pid 3497. Passing +pid 1 will attempt to checkpoint the caller's container, and if the +caller isn't privileged and init is owned by root, it will fail. + +Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid +which does not refer to a container's init task, then sys_checkpoint() +would return -EINVAL. + +We assume that during checkpoint and restart the container state is +quiescent. During checkpoint, this means that all affected tasks are +frozen (or otherwise stopped). During restart, this means that all +affected tasks are executing the sys_restart() call. In both cases, if +there are other tasks possible sharing state with the container, they +must not modify it during the operation. It is the responsibility of +the caller to follow this requirement. + +If the assumption that all tasks are frozen and that there is no other +sharing doesn't hold - then the results of the operation are undefined +(just as, e.g. not calling execve() immediately after vfork() produces +undefined results). In particular, either checkpoint will fail, or it +may produce a checkpoint image that can't be restarted, or (unlikely) +the restart may produce a container whose state does not match that of +the original container. + + +User tools +========== + +* checkpoint(1): a tool to perform a checkpoint of a container/subtree +* restart(1): a tool to restart a container/subtree +* ckptinfo: a tool to examine a checkpoint image + +It is best to use the dedicated user tools for checkpoint and restart. + +If you insist, then here is a code snippet that illustrates how a +checkpoint is initiated by a process inside a container - the logic is +similar to fork(): + ... + ckptid = checkpoint(0, ...); + switch (crid) { + case -1: + perror("checkpoint failed"); + break; + default: + fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret); + /* proceed with execution after checkpoint */ + ... + break; + case 0: + fprintf(stderr, "returned after restart\n"); + /* proceed with action required following a restart */ + ... + break; + } + ... + +And to initiate a restart, the process in an empty container can use +logic similar to execve(): + ... + if (restart(pid, ...) < 0) + perror("restart failed"); + /* only get here if restart failed */ + ... + +Note, that the code also supports "self" checkpoint, where a process +can checkpoint itself. This mode does not capture the relationships of +the task with other tasks, or any shared resources. It is useful for +application that wish to be able to save and restore their state. +They will either not use (or care about) shared resources, or they +will be aware of the operations and adapt suitably after a restart. +The code above can also be used for "self" checkpoint. + + +You may find the following sample programs useful: + +* checkpoint.c: accepts a 'pid' and checkpoint that task to stdout +* self_checkpoint.c: a simple test program doing self-checkpoint +* self_restart.c: restarts a (self-) checkpoint image from stdin + +See also the utilities 'checkpoint' and 'restart' (from user-cr). + + +"External" checkpoint +===================== + +To do "external" checkpoint, you need to first freeze that other task +either using the freezer cgroup. + +Restart does not preserve the original PID yet, (because we haven't +solved yet the fork-with-specific-pid issue). In a real scenario, you +probably want to first create a new names space, and have the init +task there call 'sys_restart()'. + +I tested it this way: + $ ./test & + [1] 3493 + + $ echo 3493 > /cgroup/0/tasks + $ echo FROZEN > /cgroup/0/freezer.state + $ ./checkpoint 3493 > ckpt.image + + $ mv /tmp/cr-test.out /tmp/cr-test.out.orig + $ cp /tmp/cr-test.out.orig /tmp/cr-test.out + + $ echo THAWED > /cgroup/0/freezer.state + + $ ./self_restart < ckpt.image +Now compare the output of the two output files. + + +"Self" checkpoint +================ + +To do self-checkpoint, you can incorporate the code from +self_checkpoint.c into your application. + +Here is how to test the self-checkpoint: + $ ./self_checkpoint > self.image & + [1] 3512 + + $ sleep 3 + $ mv /tmp/cr-self.out /tmp/cr-self.out.orig + $ cp /tmp/cr-self.out.orig /tmp/cr-self.out + + $ cat /tmp/cr-self.out + hello, world! + count 0 + count 1 + count 2 + checkpoint ret: 1 + count 3 + ... + + $ sed -i 's/count/xxxxx/g' /tmp/cr-self.out + + $ ./self_restart < self.image & + +Now compare the output of the two output files. + $ cat /tmp/cr-self.out + hello, world! + xxxxx 0 + xxxxx 1 + xxxxx 2 + checkpoint ret: 0 + count 3 + ... + + +Note how in test.c we close stdin, stdout, stderr - that's because +currently we only support regular files (not ttys/ptys). + +If you check the output of ps, you'll see that "self_restart" changed +its name to "test" or "self_checkpoint", as expected. -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 23/96] c/r: basic infrastructure for checkpoint/restart 2010-03-17 16:08 ` [C/R v20][PATCH 22/96] c/r: documentation Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 24/96] c/r: x86_32 support " Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Add those interfaces, as well as helpers needed to easily manage the file format. The code is roughly broken out as follows: checkpoint/sys.c - user/kernel data transfer, as well as setup of the c/r context (a per-checkpoint data structure for housekeeping) checkpoint/checkpoint.c - output wrappers and basic checkpoint handling checkpoint/restart.c - input wrappers and basic restart handling checkpoint/process.c - c/r of task data For now, we can only checkpoint the 'current' task ("self" checkpoint), and the 'pid' argument to the syscall is ignored. Patches to add the per-architecture support as well as the actual work to do the memory checkpoint follow in subsequent patches. Changelog[v20]: - Export key symbols to enable c/r from kernel modules Changelog[v19]: - [Serge Hallyn] Use ckpt_err() to for bad header values Changelog[v19-rc3]: - sys_{checkpoint,restart} to use ptregs prototype Changelog[v19-rc1]: - Set ctx->errno in do_ckpt_msg() if needed - Document prototype of ckpt_write_err in header - Update prototype of ckpt_read_obj() - Fix up headers so we can munge them for use by userspace - [Matt Helsley] Check for empty string for _ckpt_write_err() - [Matt Helsley] Add cpp definitions for enums - [Serge Hallyn] Add global section container to image format - [Matt Helsley] Fix total byte read/write count for large images - ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr) - [Serge Hallyn] Define new api for error and debug logging - Use logfd in sys_{checkpoint,restart} Changelog[v18]: - Detect error-headers in input data on restart, and abort. - Standard format for checkpoint error strings (and documentation) - [Matt Helsley] Rename headerless struct ckpt_hdr_* to struct ckpt_* - [Dan Smith] Add an errno validation function - Add ckpt_read_payload(): read a variable-length object (no header) - Add ckpt_read_string(): same for strings (ensures null-terminated) - Add ckpt_read_consume(): consumes next object without processing Changelog[v17]: - Fix compilation for architectures that don't support checkpoint - Save/restore t->{set,clear}_child_tid - Restart(2) isn't idempotent: must return -EINTR if interrupted - ckpt_debug does not depend on DYNAMIC_DEBUG, on by default - Export generic checkpoint headers to userespace - Fix comment for prototype of sys_restart - Have ckpt_debug() print global-pid and __LINE__ - Only save and test kernel constants once (in header) Changelog[v16]: - Split ctx->flags to ->uflags (user flags) and ->kflags (kernel flags) - Introduce __ckpt_write_err() and ckpt_write_err() to report errors - Allow @ptr == NULL to write (or read) header only without payload - Introduce _ckpt_read_obj_type() Changelog[v15]: - Replace header buffer in ckpt_ctx (hbuf,hpos) with kmalloc/kfree() Changelog[v14]: - Cleanup interface to get/put hdr buffers - Merge checkpoint and restart code into a single file (per subsystem) - Take uts_sem around access to uts->{release,version,machine} - Embed ckpt_hdr in all ckpt_hdr_...., cleanup read/write helpers - Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge) - Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch) - Explicitly indicate length of UTS fields in header - Discard field 'h->parent' from ckpt_hdr Changelog[v12]: - ckpt_kwrite/ckpt_kread() again use vfs_read(), vfs_write() (safer) - Split ckpt_write/ckpt_read() to two parts: _ckpt_write/read() helper - Befriend with sparse : explicit conversion to 'void __user *' - Redfine 'pr_fmt' instead of using special ckpt_debug() Changelog[v10]: - add ckpt_write_buffer(), ckpt_read_buffer() and ckpt_read_buf_type() - force end-of-string in ckpt_read_string() (fix possible DoS) Changelog[v9]: - ckpt_kwrite/ckpt_kread() use file->f_op->write() directly - Drop ckpt_uwrite/ckpt_uread() since they aren't used anywhere Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (although it's not really needed) Changelog[v5]: - Rename headers files s/ckpt/checkpoint/ Changelog[v2]: - Added utsname->{release,version,machine} to checkpoint header - Pad header structures to 64 bits to ensure compatibility Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- Makefile | 2 +- arch/x86/include/asm/unistd_32.h | 2 - arch/x86/kernel/syscall_table_32.S | 2 - checkpoint/Makefile | 6 +- checkpoint/checkpoint.c | 213 ++++++++++++++++ checkpoint/process.c | 102 ++++++++ checkpoint/restart.c | 459 +++++++++++++++++++++++++++++++++++ checkpoint/sys.c | 471 +++++++++++++++++++++++++++++++++++- include/linux/Kbuild | 3 + include/linux/checkpoint.h | 200 +++++++++++++++ include/linux/checkpoint_hdr.h | 130 ++++++++++ include/linux/checkpoint_types.h | 42 ++++ include/linux/magic.h | 3 + include/linux/syscalls.h | 4 - lib/Kconfig.debug | 13 + 15 files changed, 1634 insertions(+), 18 deletions(-) create mode 100644 checkpoint/checkpoint.c create mode 100644 checkpoint/process.c create mode 100644 checkpoint/restart.c create mode 100644 include/linux/checkpoint.h create mode 100644 include/linux/checkpoint_hdr.h create mode 100644 include/linux/checkpoint_types.h diff --git a/Makefile b/Makefile index 1452de1..ecced57 100644 --- a/Makefile +++ b/Makefile @@ -650,7 +650,7 @@ export mod_strip_cmd ifeq ($(KBUILD_EXTMOD),) -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/ vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 55b7cae..a66ed15 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -344,8 +344,6 @@ #define __NR_perf_event_open 336 #define __NR_recvmmsg 337 #define __NR_eclone 338 -#define __NR_checkpoint 339 -#define __NR_restart 340 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index 899a4f1..22ae7ef 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -338,5 +338,3 @@ ENTRY(sys_call_table) .long sys_perf_event_open .long sys_recvmmsg .long ptregs_eclone - .long sys_checkpoint - .long sys_restart /* 340 */ diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 8a32c6f..99364cc 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -2,4 +2,8 @@ # Makefile for linux checkpoint/restart. # -obj-$(CONFIG_CHECKPOINT) += sys.o +obj-$(CONFIG_CHECKPOINT) += \ + sys.o \ + checkpoint.o \ + restart.o \ + process.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c new file mode 100644 index 0000000..2f8b038 --- /dev/null +++ b/checkpoint/checkpoint.c @@ -0,0 +1,213 @@ +/* + * Checkpoint logic and helpers + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <linux/version.h> +#include <linux/time.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/dcache.h> +#include <linux/mount.h> +#include <linux/utsname.h> +#include <linux/magic.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +/* unique checkpoint identifier (FIXME: should be per-container ?) */ +static atomic_t ctx_count = ATOMIC_INIT(0); + +/** + * ckpt_write_obj - write an object + * @ctx: checkpoint context + * @h: object descriptor + */ +int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h) +{ + _ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len); + return ckpt_kwrite(ctx, h, h->len); +} +EXPORT_SYMBOL(ckpt_write_obj); + +/** + * ckpt_write_obj_type - write an object (from a pointer) + * @ctx: checkpoint context + * @ptr: buffer pointer + * @len: buffer size + * @type: desired type + * + * If @ptr is NULL, then write only the header (payload to follow) + */ +int ckpt_write_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type) +{ + struct ckpt_hdr *h; + int ret; + + h = ckpt_hdr_get(ctx, sizeof(*h)); + if (!h) + return -ENOMEM; + + h->type = type; + h->len = len + sizeof(*h); + + _ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len); + ret = ckpt_kwrite(ctx, h, sizeof(*h)); + if (ret < 0) + goto out; + if (ptr) + ret = ckpt_kwrite(ctx, ptr, len); + out: + _ckpt_hdr_put(ctx, h, sizeof(*h)); + return ret; +} +EXPORT_SYMBOL(ckpt_write_obj_type); + +/** + * ckpt_write_buffer - write an object of type buffer + * @ctx: checkpoint context + * @ptr: buffer pointer + * @len: buffer size + */ +int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len) +{ + return ckpt_write_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER); +} +EXPORT_SYMBOL(ckpt_write_buffer); + +/** + * ckpt_write_string - write an object of type string + * @ctx: checkpoint context + * @str: string pointer + * @len: string length + */ +int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len) +{ + return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING); +} +EXPORT_SYMBOL(ckpt_write_string); + +/*********************************************************************** + * Checkpoint + */ + +static void fill_kernel_const(struct ckpt_const *h) +{ + struct task_struct *tsk; + struct new_utsname *uts; + + /* task */ + h->task_comm_len = sizeof(tsk->comm); + /* uts */ + h->uts_release_len = sizeof(uts->release); + h->uts_version_len = sizeof(uts->version); + h->uts_machine_len = sizeof(uts->machine); +} + +/* write the checkpoint header */ +static int checkpoint_write_header(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_header *h; + struct new_utsname *uts; + struct timeval ktv; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER); + if (!h) + return -ENOMEM; + + do_gettimeofday(&ktv); + uts = utsname(); + + h->magic = CHECKPOINT_MAGIC_HEAD; + h->major = (LINUX_VERSION_CODE >> 16) & 0xff; + h->minor = (LINUX_VERSION_CODE >> 8) & 0xff; + h->patch = (LINUX_VERSION_CODE) & 0xff; + + h->rev = CHECKPOINT_VERSION; + + h->uflags = ctx->uflags; + h->time = ktv.tv_sec; + + fill_kernel_const(&h->constants); + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + return ret; + + down_read(&uts_sem); + ret = ckpt_write_buffer(ctx, uts->release, sizeof(uts->release)); + if (ret < 0) + goto up; + ret = ckpt_write_buffer(ctx, uts->version, sizeof(uts->version)); + if (ret < 0) + goto up; + ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine)); + up: + up_read(&uts_sem); + return ret; +} + +/* write the container configuration section */ +static int checkpoint_container(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_container *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CONTAINER); + if (!h) + return -ENOMEM; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + return ret; +} + +/* write the checkpoint trailer */ +static int checkpoint_write_tail(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_tail *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL); + if (!h) + return -ENOMEM; + + h->magic = CHECKPOINT_MAGIC_TAIL; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + return ret; +} + +long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) +{ + long ret; + + ret = checkpoint_write_header(ctx); + if (ret < 0) + goto out; + ret = checkpoint_container(ctx); + if (ret < 0) + goto out; + ret = checkpoint_task(ctx, current); + if (ret < 0) + goto out; + ret = checkpoint_write_tail(ctx); + if (ret < 0) + goto out; + + /* on success, return (unique) checkpoint identifier */ + ctx->crid = atomic_inc_return(&ctx_count); + ret = ctx->crid; + out: + return ret; +} diff --git a/checkpoint/process.c b/checkpoint/process.c new file mode 100644 index 0000000..d221c2a --- /dev/null +++ b/checkpoint/process.c @@ -0,0 +1,102 @@ +/* + * Checkpoint task structure + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <linux/sched.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +/*********************************************************************** + * Checkpoint + */ + +/* dump the task_struct of a given task */ +static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_task *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK); + if (!h) + return -ENOMEM; + + h->state = t->state; + h->exit_state = t->exit_state; + h->exit_code = t->exit_code; + h->exit_signal = t->exit_signal; + + h->set_child_tid = (unsigned long) t->set_child_tid; + h->clear_child_tid = (unsigned long) t->clear_child_tid; + + /* FIXME: save remaining relevant task_struct fields */ + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + return ret; + + return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN); +} + +/* dump the entire state of a given task */ +int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) +{ + int ret; + + ctx->tsk = t; + + ret = checkpoint_task_struct(ctx, t); + ckpt_debug("task %d\n", ret); + + ctx->tsk = NULL; + return ret; +} + +/*********************************************************************** + * Restart + */ + +/* read the task_struct into the current task */ +static int restore_task_struct(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_task *h; + struct task_struct *t = current; + int ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK); + if (IS_ERR(h)) + return PTR_ERR(h); + + memset(t->comm, 0, TASK_COMM_LEN); + ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN); + if (ret < 0) + goto out; + + t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid; + t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid; + + /* FIXME: restore remaining relevant task_struct fields */ + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +/* read the entire state of the current task */ +int restore_task(struct ckpt_ctx *ctx) +{ + int ret; + + ret = restore_task_struct(ctx); + ckpt_debug("task %d\n", ret); + + return ret; +} diff --git a/checkpoint/restart.c b/checkpoint/restart.c new file mode 100644 index 0000000..29e051c --- /dev/null +++ b/checkpoint/restart.c @@ -0,0 +1,459 @@ +/* + * Restart logic and helpers + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <linux/version.h> +#include <linux/sched.h> +#include <linux/file.h> +#include <linux/magic.h> +#include <linux/utsname.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h) +{ + char *ptr; + int len, ret; + + len = h->len - sizeof(*h); + ptr = kzalloc(len + 1, GFP_KERNEL); + if (!ptr) { + ckpt_debug("insufficient memory to report image error\n"); + return -ENOMEM; + } + + ret = ckpt_kread(ctx, ptr, len); + if (ret >= 0) { + ckpt_debug("%s\n", &ptr[1]); + ret = -EIO; + } + + kfree(ptr); + return ret; +} + +/** + * _ckpt_read_obj - read an object (ckpt_hdr followed by payload) + * @ctx: checkpoint context + * @h: desired ckpt_hdr + * @ptr: desired buffer + * @len: desired object length (if 0, flexible) + * @max: maximum object length (if 0, flexible) + * + * If @ptr is NULL, then read only the header (payload to follow) + */ +static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h, + void *ptr, int len, int max) +{ + int ret; + + again: + ret = ckpt_kread(ctx, h, sizeof(*h)); + if (ret < 0) + return ret; + _ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n", + h->type, h->len, len, max); + if (h->len < sizeof(*h)) + return -EINVAL; + + if (h->type == CKPT_HDR_ERROR) { + ret = _ckpt_read_err(ctx, h); + if (ret < 0) + return ret; + goto again; + } + + /* if len specified, enforce, else if maximum specified, enforce */ + if ((len && h->len != len) || (!len && max && h->len > max)) + return -EINVAL; + + if (ptr) + ret = ckpt_kread(ctx, ptr, h->len - sizeof(struct ckpt_hdr)); + return ret; +} + +/** + * _ckpt_read_obj_type - read an object of some type + * @ctx: checkpoint context + * @ptr: provided buffer + * @len: buffer length + * @type: buffer type + * + * If @ptr is NULL, then read only the header (payload to follow). + * @len specifies the expected buffer length (ignored if set to 0). + * Returns: actual _payload_ length + */ +int _ckpt_read_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type) +{ + struct ckpt_hdr h; + int ret; + + if (len) + len += sizeof(struct ckpt_hdr); + ret = _ckpt_read_obj(ctx, &h, ptr, len, len); + if (ret < 0) + return ret; + if (h.type != type) + return -EINVAL; + return h.len - sizeof(h); +} +EXPORT_SYMBOL(_ckpt_read_obj_type); + +/** + * _ckpt_read_buffer - read an object of type buffer (set length) + * @ctx: checkpoint context + * @ptr: provided buffer + * @len: buffer length + * + * If @ptr is NULL, then read only the header (payload to follow). + * @len specifies the expected buffer length (ignored if set to 0). + * Returns: _payload_ length. + */ +int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len) +{ + BUG_ON(!len); + return _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER); +} +EXPORT_SYMBOL(_ckpt_read_buffer); + +/** + * _ckpt_read_string - read an object of type string (set length) + * @ctx: checkpoint context + * @ptr: provided buffer + * @len: string length (including '\0') + * + * If @ptr is NULL, then read only the header (payload to follow) + */ +int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len) +{ + int ret; + + BUG_ON(!len); + ret = _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_STRING); + if (ret < 0) + return ret; + if (ptr) + ((char *) ptr)[len - 1] = '\0'; /* always play it safe */ + return 0; +} +EXPORT_SYMBOL(_ckpt_read_string); + +/** + * ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload) + * @ctx: checkpoint context + * @h: object descriptor + * @len: desired total length (if 0, flexible) + * @max: maximum total length + * + * Return: new buffer allocated on success, error pointer otherwise + */ +static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max) +{ + struct ckpt_hdr hh; + struct ckpt_hdr *h; + int ret; + + ret = ckpt_kread(ctx, &hh, sizeof(hh)); + if (ret < 0) + return ERR_PTR(ret); + _ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n", + hh.type, hh.len, len, max); + if (hh.len < sizeof(*h)) + return ERR_PTR(-EINVAL); + /* if len specified, enforce, else if maximum specified, enforce */ + if ((len && hh.len != len) || (!len && max && hh.len > max)) + return ERR_PTR(-EINVAL); + + h = ckpt_hdr_get(ctx, hh.len); + if (!h) + return ERR_PTR(-ENOMEM); + + *h = hh; /* yay ! */ + + ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr)); + if (ret < 0) { + ckpt_hdr_put(ctx, h); + h = ERR_PTR(ret); + } + + return h; +} + +/** + * ckpt_read_obj_type - allocate and read an object of some type + * @ctx: checkpoint context + * @len: desired object length + * @type: desired object type + * + * Return: new buffer allocated on success, error pointer otherwise + */ +void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type) +{ + struct ckpt_hdr *h; + + BUG_ON(!len); + + h = ckpt_read_obj(ctx, len, len); + if (IS_ERR(h)) + return h; + + if (h->type != type) { + ckpt_hdr_put(ctx, h); + h = ERR_PTR(-EINVAL); + } + + return h; +} +EXPORT_SYMBOL(ckpt_read_obj_type); + +/** + * ckpt_read_buf_type - allocate and read an object of some type (flxible) + * @ctx: checkpoint context + * @max: maximum payload length + * @type: desired object type + * + * This differs from ckpt_read_obj_type() in that the length of the + * incoming object is flexible (up to the maximum specified by @max; + * unlimited if @max is 0), as determined by the ckpt_hdr data. + * + * NOTE: for symmetry with checkpoint, @max is the maximum _payload_ + * size, excluding the header. + * + * Return: new buffer allocated on success, error pointer otherwise + */ +void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int max, int type) +{ + struct ckpt_hdr *h; + + if (max) + max += sizeof(struct ckpt_hdr); + + h = ckpt_read_obj(ctx, 0, max); + if (IS_ERR(h)) + return h; + + if (h->type != type) { + ckpt_hdr_put(ctx, h); + h = ERR_PTR(-EINVAL); + } + + return h; +} +EXPORT_SYMBOL(ckpt_read_buf_type); + +/** + * ckpt_read_payload - allocate and read the payload of an object + * @ctx: checkpoint context + * @max: maximum payload length + * @str: pointer to buffer to be allocated (caller must free) + * @type: desired object type + * + * This can be used to read a variable-length _payload_ from the checkpoint + * stream. @max limits the size of the resulting buffer. + * + * Return: actual _payload_ length + */ +int ckpt_read_payload(struct ckpt_ctx *ctx, void **ptr, int max, int type) +{ + int len, ret; + + len = _ckpt_read_obj_type(ctx, NULL, 0, type); + if (len < 0) + return len; + else if (len > max) + return -EINVAL; + + *ptr = kmalloc(len, GFP_KERNEL); + if (!*ptr) + return -ENOMEM; + + ret = ckpt_kread(ctx, *ptr, len); + if (ret < 0) { + kfree(*ptr); + return ret; + } + + return len; +} +EXPORT_SYMBOL(ckpt_read_payload); + +/** + * ckpt_read_string - allocate and read a string (variable length) + * @ctx: checkpoint context + * @max: maximum acceptable length + * + * Return: allocate string or error pointer + */ +char *ckpt_read_string(struct ckpt_ctx *ctx, int max) +{ + char *str; + int len; + + len = ckpt_read_payload(ctx, (void **)&str, max, CKPT_HDR_STRING); + if (len < 0) + return ERR_PTR(len); + str[len - 1] = '\0'; /* always play it safe */ + return str; +} +EXPORT_SYMBOL(ckpt_read_string); + +/** + * ckpt_read_consume - consume the next object of expected type + * @ctx: checkpoint context + * @len: desired object length + * @type: desired object type + * + * This can be used to skip an object in the input stream when the + * data is unnecessary for the restart. @len indicates the length of + * the object); if @len is zero the length is unconstrained. + */ +int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type) +{ + struct ckpt_hdr *h; + int ret = 0; + + h = ckpt_read_obj(ctx, len, 0); + if (IS_ERR(h)) + return PTR_ERR(h); + + if (h->type != type) + ret = -EINVAL; + + ckpt_hdr_put(ctx, h); + return ret; +} +EXPORT_SYMBOL(ckpt_read_consume); + +/*********************************************************************** + * Restart + */ + +static int check_kernel_const(struct ckpt_const *h) +{ + struct task_struct *tsk; + struct new_utsname *uts; + + /* task */ + if (h->task_comm_len != sizeof(tsk->comm)) + return -EINVAL; + /* uts */ + if (h->uts_release_len != sizeof(uts->release)) + return -EINVAL; + if (h->uts_version_len != sizeof(uts->version)) + return -EINVAL; + if (h->uts_machine_len != sizeof(uts->machine)) + return -EINVAL; + + return 0; +} + +/* read the checkpoint header */ +static int restore_read_header(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_header *h; + struct new_utsname *uts = NULL; + int ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER); + if (IS_ERR(h)) + return PTR_ERR(h); + + ret = -EINVAL; + if (h->magic != CHECKPOINT_MAGIC_HEAD || + h->rev != CHECKPOINT_VERSION || + h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) || + h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) || + h->patch != ((LINUX_VERSION_CODE) & 0xff)) { + ckpt_err(ctx, ret, "incompatible kernel version"); + goto out; + } + if (h->uflags) { + ckpt_err(ctx, ret, "incompatible restart user flags"); + goto out; + } + + ret = check_kernel_const(&h->constants); + if (ret < 0) { + ckpt_err(ctx, ret, "incompatible kernel constants"); + goto out; + } + + ret = -ENOMEM; + uts = kmalloc(sizeof(*uts), GFP_KERNEL); + if (!uts) + goto out; + + ctx->oflags = h->uflags; + + /* FIX: verify compatibility of release, version and machine */ + ret = _ckpt_read_buffer(ctx, uts->release, sizeof(uts->release)); + if (ret < 0) + goto out; + ret = _ckpt_read_buffer(ctx, uts->version, sizeof(uts->version)); + if (ret < 0) + goto out; + ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine)); + out: + kfree(uts); + ckpt_hdr_put(ctx, h); + return ret; +} + +/* read the container configuration section */ +static int restore_container(struct ckpt_ctx *ctx) +{ + int ret = 0; + struct ckpt_hdr_container *h; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CONTAINER); + if (IS_ERR(h)) + return PTR_ERR(h); + ckpt_hdr_put(ctx, h); + + return ret; +} + +/* read the checkpoint trailer */ +static int restore_read_tail(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_tail *h; + int ret = 0; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TAIL); + if (IS_ERR(h)) + return PTR_ERR(h); + + if (h->magic != CHECKPOINT_MAGIC_TAIL) + ret = -EINVAL; + + ckpt_hdr_put(ctx, h); + return ret; +} + +long do_restart(struct ckpt_ctx *ctx, pid_t pid) +{ + long ret; + + ret = restore_read_header(ctx); + if (ret < 0) + return ret; + ret = restore_container(ctx); + if (ret < 0) + return ret; + ret = restore_task(ctx); + if (ret < 0) + return ret; + ret = restore_read_tail(ctx); + + /* on success, adjust the return value if needed [TODO] */ + return ret; +} diff --git a/checkpoint/sys.c b/checkpoint/sys.c index a81750a..f642485 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -8,12 +8,408 @@ * distribution for more details. */ +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + #include <linux/sched.h> #include <linux/kernel.h> #include <linux/syscalls.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/uaccess.h> +#include <linux/capability.h> +#include <linux/checkpoint.h> + +/* + * Helpers to write(read) from(to) kernel space to(from) the checkpoint + * image file descriptor (similar to how a core-dump is performed). + * + * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image + * ckpt_kread() - read from the checkpoint image to a kernel-space buffer + */ + +static inline int _ckpt_kwrite(struct file *file, void *addr, int count) +{ + void __user *uaddr = (__force void __user *) addr; + ssize_t nwrite; + int nleft; + + for (nleft = count; nleft; nleft -= nwrite) { + loff_t pos = file_pos_read(file); + nwrite = vfs_write(file, uaddr, nleft, &pos); + file_pos_write(file, pos); + if (nwrite < 0) { + if (nwrite == -EAGAIN) + nwrite = 0; + else + return nwrite; + } + uaddr += nwrite; + } + return 0; +} + +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) +{ + mm_segment_t fs; + int ret; + + fs = get_fs(); + set_fs(KERNEL_DS); + ret = _ckpt_kwrite(ctx->file, addr, count); + set_fs(fs); + + ctx->total += count; + return ret; +} + +static inline int _ckpt_kread(struct file *file, void *addr, int count) +{ + void __user *uaddr = (__force void __user *) addr; + ssize_t nread; + int nleft; + + for (nleft = count; nleft; nleft -= nread) { + loff_t pos = file_pos_read(file); + nread = vfs_read(file, uaddr, nleft, &pos); + file_pos_write(file, pos); + if (nread <= 0) { + if (nread == -EAGAIN) { + nread = 0; + continue; + } else if (nread == 0) + nread = -EPIPE; /* unexecpted EOF */ + return nread; + } + uaddr += nread; + } + return 0; +} + +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) +{ + mm_segment_t fs; + int ret; + + fs = get_fs(); + set_fs(KERNEL_DS); + ret = _ckpt_kread(ctx->file , addr, count); + set_fs(fs); + + ctx->total += count; + return ret; +} + +/** + * ckpt_hdr_get - get a hdr of certain size + * @ctx: checkpoint context + * @len: desired length + * + * Returns pointer to header + */ +void *ckpt_hdr_get(struct ckpt_ctx *ctx, int len) +{ + return kzalloc(len, GFP_KERNEL); +} +EXPORT_SYMBOL(ckpt_hdr_get); + +/** + * _ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get + * @ctx: checkpoint context + * @ptr: header to free + * @len: header length + * + * (requiring 'ptr' makes it easily interchangable with kmalloc/kfree + */ +void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int len) +{ + kfree(ptr); +} +EXPORT_SYMBOL(_ckpt_hdr_put); + +/** + * ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get + * @ctx: checkpoint context + * @ptr: header to free + * + * It is assumed that @ptr begins with a 'struct ckpt_hdr'. + */ +void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr) +{ + struct ckpt_hdr *h = (struct ckpt_hdr *) ptr; + _ckpt_hdr_put(ctx, ptr, h->len); +} +EXPORT_SYMBOL(ckpt_hdr_put); + +/** + * ckpt_hdr_get_type - get a hdr of certain size + * @ctx: checkpoint context + * @len: number of bytes to reserve + * + * Returns pointer to reserved space on hbuf + */ +void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type) +{ + struct ckpt_hdr *h; + + h = ckpt_hdr_get(ctx, len); + if (!h) + return NULL; + + h->type = type; + h->len = len; + return h; +} +EXPORT_SYMBOL(ckpt_hdr_get_type); + +/* + * Helpers to manage c/r contexts: allocated for each checkpoint and/or + * restart operation, and persists until the operation is completed. + */ + +static void ckpt_ctx_free(struct ckpt_ctx *ctx) +{ + if (ctx->file) + fput(ctx->file); + if (ctx->logfile) + fput(ctx->logfile); + kfree(ctx); +} + +static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, + unsigned long kflags, int logfd) +{ + struct ckpt_ctx *ctx; + int err; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return ERR_PTR(-ENOMEM); + + ctx->uflags = uflags; + ctx->kflags = kflags; + + mutex_init(&ctx->msg_mutex); + + err = -EBADF; + ctx->file = fget(fd); + if (!ctx->file) + goto err; + if (logfd == CHECKPOINT_FD_NONE) + goto nolog; + ctx->logfile = fget(logfd); + if (!ctx->logfile) + goto err; + nolog: + return ctx; + err: + ckpt_ctx_free(ctx); + return ERR_PTR(err); +} + +static void ckpt_set_error(struct ckpt_ctx *ctx, int err) +{ + if (!ckpt_test_and_set_ctx_kflag(ctx, CKPT_CTX_ERROR)) + ctx->errno = err; +} + +/* helpers to handler log/dbg/err messages */ +void ckpt_msg_lock(struct ckpt_ctx *ctx) +{ + if (!ctx) + return; + mutex_lock(&ctx->msg_mutex); + ctx->msg[0] = '\0'; + ctx->msglen = 1; +} + +void ckpt_msg_unlock(struct ckpt_ctx *ctx) +{ + if (!ctx) + return; + mutex_unlock(&ctx->msg_mutex); +} + +static inline int is_special_flag(char *s) +{ + if (*s == '%' && s[1] == '(' && s[2] != '\0' && s[3] == ')') + return 1; + return 0; +} + +/* + * _ckpt_generate_fmt - handle the special flags in the enhanced format + * strings used by checkpoint/restart error messages. + * @ctx: checkpoint context + * @fmt: message format + * + * The special flags are surrounded by %() to help them visually stand + * out. For instance, %(O) means an objref. The following special + * flags are recognized: + * O: objref + * P: pointer + * T: task + * S: string + * V: variable + * + * %(O) will be expanded to "[obj %d]". Likewise P, S, and V, will + * also expand to format flags requiring an argument to the subsequent + * sprintf or printk. T will be expanded to a string with no flags, + * requiring no further arguments. + * + * These do not accept any extra flags (i.e. min field width, precision, + * etc). + * + * The caller of ckpt_err() and _ckpt_err() must provide + * the additional variabes, in order, to match the @fmt (except for + * the T key), e.g.: + * + * ckpt_err(ctx, err, "%(T)FILE flags %d %(O)\n", flags, objref); + * + * May be called under spinlock. + * Must be called with ctx->msg_mutex held. The expanded format + * will be placed in ctx->fmt. + */ +static void _ckpt_generate_fmt(struct ckpt_ctx *ctx, char *fmt) +{ + char *s = ctx->fmt; + int len = 0; + + for (; *fmt && len < CKPT_MSG_LEN; fmt++) { + if (!is_special_flag(fmt)) { + s[len++] = *fmt; + continue; + } + switch (fmt[2]) { + case 'O': + len += snprintf(s+len, CKPT_MSG_LEN-len, "[obj %%d]"); + break; + case 'P': + len += snprintf(s+len, CKPT_MSG_LEN-len, "[ptr %%p]"); + break; + case 'V': + len += snprintf(s+len, CKPT_MSG_LEN-len, "[sym %%pS]"); + break; + case 'S': + len += snprintf(s+len, CKPT_MSG_LEN-len, "[str %%s]"); + break; + case 'T': + if (ctx->tsk) + len += snprintf(s+len, CKPT_MSG_LEN-len, + "[pid %d tsk %s]", + task_pid_vnr(ctx->tsk), ctx->tsk->comm); + else + len += snprintf(s+len, CKPT_MSG_LEN-len, + "[pid -1 tsk NULL]"); + break; + default: + printk(KERN_ERR "c/r: bad format specifier %c\n", + fmt[2]); + BUG(); + } + fmt += 3; + } + if (len == CKPT_MSG_LEN) + s[CKPT_MSG_LEN-1] = '\0'; + else + s[len] = '\0'; +} + +static void _ckpt_msg_appendv(struct ckpt_ctx *ctx, int err, char *fmt, + va_list ap) +{ + int len = ctx->msglen; + + if (err) { + len += snprintf(&ctx->msg[len], CKPT_MSG_LEN-len, "[err %d]", + err); + if (len > CKPT_MSG_LEN) + goto full; + } + + len += snprintf(&ctx->msg[len], CKPT_MSG_LEN-len, "[pos %lld]", + ctx->total); + len += vsnprintf(&ctx->msg[len], CKPT_MSG_LEN-len, fmt, ap); + if (len > CKPT_MSG_LEN) { +full: + len = CKPT_MSG_LEN; + ctx->msg[CKPT_MSG_LEN-1] = '\0'; + } + ctx->msglen = len; +} + +void _ckpt_msg_append(struct ckpt_ctx *ctx, char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + _ckpt_msg_appendv(ctx, 0, fmt, ap); + va_end(ap); +} + +void _ckpt_msg_complete(struct ckpt_ctx *ctx) +{ + int ret; + + /* Don't write an empty or uninitialized msg */ + if (ctx->msglen <= 1) + return; + + if (ctx->kflags & CKPT_CTX_CHECKPOINT && ctx->errno) { + ret = ckpt_write_obj_type(ctx, NULL, 0, CKPT_HDR_ERROR); + if (!ret) + ret = ckpt_write_string(ctx, ctx->msg, ctx->msglen); + if (ret < 0) + printk(KERN_NOTICE "c/r: error string unsaved (%d): %s\n", + ret, ctx->msg+1); + } + + if (ctx->logfile) { + mm_segment_t fs = get_fs(); + set_fs(KERNEL_DS); + ret = _ckpt_kwrite(ctx->logfile, ctx->msg+1, ctx->msglen-1); + set_fs(fs); + } + +#ifdef CONFIG_CHECKPOINT_DEBUG + printk(KERN_DEBUG "%s", ctx->msg+1); +#endif + + ctx->msglen = 0; +} + +#define __do_ckpt_msg(ctx, err, fmt) do { \ + va_list ap; \ + _ckpt_generate_fmt(ctx, fmt); \ + va_start(ap, fmt); \ + _ckpt_msg_appendv(ctx, err, ctx->fmt, ap); \ + va_end(ap); \ +} while (0) + +void _do_ckpt_msg(struct ckpt_ctx *ctx, int err, char *fmt, ...) +{ + __do_ckpt_msg(ctx, err, fmt); +} + +void do_ckpt_msg(struct ckpt_ctx *ctx, int err, char *fmt, ...) +{ + if (!ctx) + return; + + ckpt_msg_lock(ctx); + __do_ckpt_msg(ctx, err, fmt); + _ckpt_msg_complete(ctx); + ckpt_msg_unlock(ctx); + + if (err) + ckpt_set_error(ctx, err); +} +EXPORT_SYMBOL(do_ckpt_msg); + +/* checkpoint/restart syscalls */ /** - * sys_checkpoint - checkpoint a container + * do_sys_checkpoint - checkpoint a container * @pid: pid of the container init(1) process * @fd: file to which dump the checkpoint image * @flags: checkpoint operation flags @@ -22,14 +418,32 @@ * Returns positive identifier on success, 0 when returning from restart * or negative value on error */ -SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd, - unsigned long, flags, int, logfd) +long do_sys_checkpoint(pid_t pid, int fd, unsigned long flags, int logfd) { - return -ENOSYS; + struct ckpt_ctx *ctx; + long ret; + + /* no flags for now */ + if (flags) + return -EINVAL; + + if (pid == 0) + pid = task_pid_vnr(current); + ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT, logfd); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = do_checkpoint(ctx, pid); + + if (!ret) + ret = ctx->crid; + + ckpt_ctx_free(ctx); + return ret; } /** - * sys_restart - restart a container + * do_sys_restart - restart a container * @pid: pid of task root (in coordinator's namespace), or 0 * @fd: file from which read the checkpoint image * @flags: restart operation flags @@ -38,8 +452,49 @@ SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd, * Returns negative value on error, or otherwise returns in the realm * of the original checkpoint */ -SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, - unsigned long, flags, int, logfd) +long do_sys_restart(pid_t pid, int fd, unsigned long flags, int logfd) +{ + struct ckpt_ctx *ctx = NULL; + long ret; + + /* no flags for now */ + if (flags) + return -EINVAL; + + ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART, logfd); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = do_restart(ctx, pid); + + /* restart(2) isn't idempotent: can't restart syscall */ + if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR || + ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK) + ret = -EINTR; + + ckpt_ctx_free(ctx); + return ret; +} + + +/* 'ckpt_debug_level' controls the verbosity level of c/r code */ +#ifdef CONFIG_CHECKPOINT_DEBUG + +/* FIX: allow to change during runtime */ +unsigned long __read_mostly ckpt_debug_level = CKPT_DDEFAULT; +EXPORT_SYMBOL(ckpt_debug_level); + +static __init int ckpt_debug_setup(char *s) { - return -ENOSYS; + long val, ret; + + ret = strict_strtoul(s, 10, &val); + if (ret < 0) + return ret; + ckpt_debug_level = val; + return 0; } + +__setup("ckpt_debug=", ckpt_debug_setup); + +#endif /* CONFIG_CHECKPOINT_DEBUG */ diff --git a/include/linux/Kbuild b/include/linux/Kbuild index 756f831..bcf487c 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -44,6 +44,9 @@ header-y += bpqether.h header-y += bsg.h header-y += can.h header-y += cdk.h +header-y += checkpoint.h +header-y += checkpoint_hdr.h +header-y += checkpoint_types.h header-y += chio.h header-y += coda_psdev.h header-y += coff.h diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h new file mode 100644 index 0000000..8591f79 --- /dev/null +++ b/include/linux/checkpoint.h @@ -0,0 +1,200 @@ +#ifndef _LINUX_CHECKPOINT_H_ +#define _LINUX_CHECKPOINT_H_ +/* + * Generic checkpoint-restart + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#define CHECKPOINT_VERSION 3 + +/* misc user visible */ +#define CHECKPOINT_FD_NONE -1 + +#ifdef __KERNEL__ +#ifdef CONFIG_CHECKPOINT + +#include <linux/checkpoint_types.h> +#include <linux/checkpoint_hdr.h> +#include <linux/err.h> + +/* sycall helpers */ +extern long do_sys_checkpoint(pid_t pid, int fd, + unsigned long flags, int logfd); +extern long do_sys_restart(pid_t pid, int fd, + unsigned long flags, int logfd); + +/* ckpt_ctx: kflags */ +#define CKPT_CTX_CHECKPOINT_BIT 0 +#define CKPT_CTX_RESTART_BIT 1 + +#define CKPT_CTX_CHECKPOINT (1 << CKPT_CTX_CHECKPOINT_BIT) +#define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT) + + +extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count); +extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count); + +extern void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int n); +extern void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr); +extern void *ckpt_hdr_get(struct ckpt_ctx *ctx, int n); +extern void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int n, int type); + +extern int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h); +extern int ckpt_write_obj_type(struct ckpt_ctx *ctx, + void *ptr, int len, int type); +extern int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len); +extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len); + +extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx, + void *ptr, int len, int type); +extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len); +extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len); +extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type); +extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int max, int type); +extern int ckpt_read_payload(struct ckpt_ctx *ctx, + void **ptr, int max, int type); +extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max); +extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type); + +extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid); +extern long do_restart(struct ckpt_ctx *ctx, pid_t pid); + +/* task */ +extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t); +extern int restore_task(struct ckpt_ctx *ctx); + +static inline int ckpt_validate_errno(int errno) +{ + return (errno >= 0) && (errno < MAX_ERRNO); +} + +/* debugging flags */ +#define CKPT_DBASE 0x1 /* anything */ +#define CKPT_DSYS 0x2 /* generic (system) */ +#define CKPT_DRW 0x4 /* image read/write */ + +#define CKPT_DDEFAULT 0xffff /* default debug level */ + +#ifndef CKPT_DFLAG +#define CKPT_DFLAG 0xffff /* everything */ +#endif + +#ifdef CONFIG_CHECKPOINT_DEBUG +extern unsigned long ckpt_debug_level; + +/* + * This is deprecated + */ +/* use this to select a specific debug level */ +#define _ckpt_debug(level, fmt, args...) \ + do { \ + if (ckpt_debug_level & (level)) \ + printk(KERN_DEBUG "[%d:%d:c/r:%s:%d] " fmt, \ + current->pid, \ + current->nsproxy ? \ + task_pid_vnr(current) : -1, \ + __func__, __LINE__, ## args); \ + } while (0) + +/* + * CKPT_DBASE is the base flags, doesn't change + * CKPT_DFLAG is to be redfined in each source file + */ +#define ckpt_debug(fmt, args...) \ + _ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args) + +#else + +/* + * This is deprecated + */ +#define _ckpt_debug(level, fmt, args...) do { } while (0) +#define ckpt_debug(fmt, args...) do { } while (0) + +#endif /* CONFIG_CHECKPOINT_DEBUG */ + +/* + * prototypes for the new logging api + */ + +extern void ckpt_msg_lock(struct ckpt_ctx *ctx); +extern void ckpt_msg_unlock(struct ckpt_ctx *ctx); + +extern void _do_ckpt_msg(struct ckpt_ctx *ctx, int err, char *fmt, ...); +extern void do_ckpt_msg(struct ckpt_ctx *ctx, int err, char *fmt, ...); + +/* + * Append formatted msg to ctx->msg[ctx->msg_len]. + * Must be called after expanding format. + * May be called under spinlock. + * Must be called under ckpt_msg_lock(). + */ +extern void _ckpt_msg_append(struct ckpt_ctx *ctx, char *fmt, ...); + +/* + * Write ctx->msg to all relevant places. + * Must not be called under spinlock. + * Must be called under ckpt_msg_lock(). + */ +extern void _ckpt_msg_complete(struct ckpt_ctx *ctx); + +/* + * Append an enhanced formatted message to ctx->msg. + * This will not write the message out to the applicable files, so + * the caller will have to use _ckpt_msg_complete() to finish up. + * @ctx must be a valid checkpoint context. + * @fmt is the extended format + * + * Must be called with ckpt_msg_lock held. + */ +#define _ckpt_msg(ctx, fmt, args...) do { \ + _do_ckpt_msg(ctx, 0, ftm, ##args); \ +} while (0) + +/* + * Append an enhanced formatted message to ctx->msg. + * This will take the ckpt_msg_lock and also write the message out + * to the applicable files by calling _ckpt_msg_complete(). + * @ctx must be a valid checkpoint context. + * @fmt is the extended format + * + * Must not be called under spinlock. + */ +#define ckpt_msg(ctx, fmt, args...) do { \ + do_ckpt_msg(ctx, 0, ftm, ##args); \ +} while (0) + +/* + * Report an error. + * This will take the ckpt_msg_lock and also write the message out + * to the applicable files by calling _ckpt_msg_complete(). + * @ctx must be a valid checkpoint context. + * @err is the error value + * @fmt is the extended format + * + * Must not be called under spinlock. + */ + +#define ckpt_err(ctx, err, fmt, args...) do { \ + do_ckpt_msg(ctx, err, "[E @ %s:%d]" fmt, __func__, __LINE__, ##args); \ +} while (0) + +/* + * Same as ckpt_err() but + * must be called with ctx->msg_mutex held + * can be called under spinlock + * must be followed by a call to _ckpt_msg_complete() + */ +#define _ckpt_err(ctx, err, fmt, args...) do { \ + _do_ckpt_msg(ctx, err, "[E @ %s:%d]" fmt, __func__, __LINE__, ##args); \ +} while (0) + +#endif /* CONFIG_CHECKPOINT */ +#endif /* __KERNEL__ */ + +#endif /* _LINUX_CHECKPOINT_H_ */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h new file mode 100644 index 0000000..97330ec --- /dev/null +++ b/include/linux/checkpoint_hdr.h @@ -0,0 +1,130 @@ +#ifndef _CHECKPOINT_CKPT_HDR_H_ +#define _CHECKPOINT_CKPT_HDR_H_ +/* + * Generic container checkpoint-restart + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#ifndef __KERNEL__ +#include <sys/types.h> +#include <linux/types.h> +#endif + +#ifdef __KERNEL__ +#include <linux/types.h> +#endif + +#include <linux/utsname.h> + +/* + * To maintain compatibility between 32-bit and 64-bit architecture flavors, + * keep data 64-bit aligned: use padding for structure members, and use + * __attribute__((aligned (8))) for the entire structure. + * + * Quoting Arnd Bergmann: + * "This structure has an odd multiple of 32-bit members, which means + * that if you put it into a larger structure that also contains 64-bit + * members, the larger structure may get different alignment on x86-32 + * and x86-64, which you might want to avoid. I can't tell if this is + * an actual problem here. ... In this case, I'm pretty sure that + * sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it + * will be 32-bit aligned on x86-32." + */ + +/* + * header format: 'struct ckpt_hdr' must prefix all other headers. Therfore + * when a header is passed around, the information about it (type, size) + * is readily available. Structs that include a struct ckpt_hdr are named + * struct ckpt_hdr_* by convention (usualy the struct ckpt_hdr is the first + * member). + */ +struct ckpt_hdr { + __u32 type; + __u32 len; +} __attribute__((aligned(8))); + +/* header types */ +enum { + CKPT_HDR_HEADER = 1, +#define CKPT_HDR_HEADER CKPT_HDR_HEADER + CKPT_HDR_CONTAINER, +#define CKPT_HDR_CONTAINER CKPT_HDR_CONTAINER + CKPT_HDR_BUFFER, +#define CKPT_HDR_BUFFER CKPT_HDR_BUFFER + CKPT_HDR_STRING, +#define CKPT_HDR_STRING CKPT_HDR_STRING + + CKPT_HDR_TASK = 101, +#define CKPT_HDR_TASK CKPT_HDR_TASK + + CKPT_HDR_TAIL = 9001, +#define CKPT_HDR_TAIL CKPT_HDR_TAIL + + CKPT_HDR_ERROR = 9999, +#define CKPT_HDR_ERROR CKPT_HDR_ERROR +}; + +/* kernel constants */ +struct ckpt_const { + /* task */ + __u16 task_comm_len; + /* uts */ + __u16 uts_release_len; + __u16 uts_version_len; + __u16 uts_machine_len; +} __attribute__((aligned(8))); + +/* checkpoint image header */ +struct ckpt_hdr_header { + struct ckpt_hdr h; + __u64 magic; + + __u16 _padding; + + __u16 major; + __u16 minor; + __u16 patch; + __u16 rev; + + struct ckpt_const constants; + + __u64 time; /* when checkpoint taken */ + __u64 uflags; /* uflags from checkpoint */ + + /* + * the header is followed by three strings: + * char release[const.uts_release_len]; + * char version[const.uts_version_len]; + * char machine[const.uts_machine_len]; + */ +} __attribute__((aligned(8))); + +/* checkpoint image trailer */ +struct ckpt_hdr_tail { + struct ckpt_hdr h; + __u64 magic; +} __attribute__((aligned(8))); + +/* container configuration section header */ +struct ckpt_hdr_container { + struct ckpt_hdr h; +} __attribute__((aligned(8)));; + +/* task data */ +struct ckpt_hdr_task { + struct ckpt_hdr h; + __u32 state; + __u32 exit_state; + __u32 exit_code; + __u32 exit_signal; + + __u64 set_child_tid; + __u64 clear_child_tid; +} __attribute__((aligned(8))); + +#endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h new file mode 100644 index 0000000..6327ad0 --- /dev/null +++ b/include/linux/checkpoint_types.h @@ -0,0 +1,42 @@ +#ifndef _LINUX_CHECKPOINT_TYPES_H_ +#define _LINUX_CHECKPOINT_TYPES_H_ +/* + * Generic checkpoint-restart + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#ifdef __KERNEL__ + +#include <linux/fs.h> + +struct ckpt_ctx { + int crid; /* unique checkpoint id */ + + pid_t root_pid; /* container identifier */ + + unsigned long kflags; /* kerenl flags */ + unsigned long uflags; /* user flags */ + unsigned long oflags; /* restart: uflags from checkpoint */ + + struct file *file; /* input/output file */ + struct file *logfile; /* status/debug log file */ + loff_t total; /* total read/written */ + + struct task_struct *tsk;/* checkpoint: current target task */ + char err_string[256]; /* checkpoint: error string */ + +#define CKPT_MSG_LEN 1024 + char fmt[CKPT_MSG_LEN]; + char msg[CKPT_MSG_LEN]; + int msglen; + struct mutex msg_mutex; +}; + +#endif /* __KERNEL__ */ + +#endif /* _LINUX_CHECKPOINT_TYPES_H_ */ diff --git a/include/linux/magic.h b/include/linux/magic.h index 76285e0..fb54f14 100644 --- a/include/linux/magic.h +++ b/include/linux/magic.h @@ -59,4 +59,7 @@ #define DEVPTS_SUPER_MAGIC 0x1cd1 #define SOCKFS_MAGIC 0x534F434B +#define CHECKPOINT_MAGIC_HEAD 0x00feed0cc0a2d200LL +#define CHECKPOINT_MAGIC_TAIL 0x002d2a0cc0deef00LL + #endif /* __LINUX_MAGIC_H__ */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 3d80ac0..207466a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -825,10 +825,6 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *, asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int, struct timespec __user *, const sigset_t __user *, size_t); -asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags, - int logfd); -asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags, - int logfd); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 25c3ed5..943ac71 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1054,6 +1054,19 @@ config DMA_API_DEBUG This option causes a performance degredation. Use only if you want to debug device drivers. If unsure, say N. +config CHECKPOINT_DEBUG + bool "Checkpoint/restart debugging (EXPERIMENTAL)" + depends on CHECKPOINT + default y + help + This options turns on the debugging output of checkpoint/restart. + The level of verbosity is controlled by 'ckpt_debug_level' and can + be set at boot time with "ckpt_debug=" option. + + Turning this option off will reduce the size of the c/r code. If + turned on, it is unlikely to incur visible overhead if the debug + level is set to zero. + source "samples/Kconfig" source "lib/Kconfig.kgdb" -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 24/96] c/r: x86_32 support for checkpoint/restart 2010-03-17 16:08 ` [C/R v20][PATCH 23/96] c/r: basic infrastructure for checkpoint/restart Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 25/96] c/r: x86-64: checkpoint/restart implementation Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Add logic to save and restore architecture specific state, including thread-specific state, CPU registers and FPU state. In addition, architecture capabilities are saved in an architecure specific extension of the header (ckpt_hdr_head_arch); Currently this includes only FPU capabilities. Currently only x86-32 is supported. Changelog[v19]: - [Serge Hallyn] Use ckpt_err() for arch incompatbilities Changelog[v19-rc3]: - Rebase to kernel 2.6.33: * Use PTREGSCALL4 for sys_{checkpoint,restart} * Remove debug-reg support (need to redo with perf_events) - [Serge Hallyn] Support for ia32 (checkpoint, restart) - Split arch/x86/checkpoint.c to generic and 32bit specific parts - sys_{checkpoint,restore} to use ptregs Changelog[v19-rc1]: - Fix up headers so we can munge them for use by userspace - [Matt Helsley] Add cpp definitions for enums - Allow X86_EFLAGS_RF on restart Changelog[v17]: - Fix compilation for architectures that don't support checkpoint - Validate cpu registers and TLS descriptors on restart - Validate debug registers on restart - Export asm/checkpoint_hdr.h to userspace Changelog[v16]: - All objects are preceded by ckpt_hdr (TLS and xstate_buf) - Add architecture identifier to main header Changelog[v14]: - Use new interface ckpt_hdr_get/put() - Embed struct ckpt_hdr in struct ckpt_hdr... - Remove preempt_disable/enable() around init_fpu() and fix leak - Revert change to pr_debug(), back to ckpt_debug() - Move code related to task_struct to checkpoint/process.c Changelog[v12]: - A couple of missed calls to ckpt_hbuf_put() - Replace obsolete ckpt_debug() with pr_debug() Changelog[v9]: - Add arch-specific header that details architecture capabilities; split FPU restore to send capabilities only once. - Test for zero TLS entries in ckpt_write_thread() - Fix asm/checkpoint_hdr.h so it can be included from user-space Changelog[v7]: - Fix save/restore state of FPU Changelog[v5]: - Remove preempt_disable() when restoring debug registers Changelog[v4]: - Fix header structure alignment Changelog[v2]: - Pad header structures to 64 bits to ensure compatibility - Follow Dave Hansen's refactoring of the original post Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- arch/x86/ia32/ia32entry.S | 9 + arch/x86/include/asm/Kbuild | 1 + arch/x86/include/asm/checkpoint_hdr.h | 112 +++++++++ arch/x86/include/asm/syscalls.h | 6 + arch/x86/include/asm/unistd_32.h | 2 + arch/x86/kernel/Makefile | 8 + arch/x86/kernel/checkpoint.c | 420 +++++++++++++++++++++++++++++++++ arch/x86/kernel/checkpoint_32.c | 173 ++++++++++++++ arch/x86/kernel/entry_32.S | 8 + arch/x86/kernel/syscall_table_32.S | 2 + checkpoint/checkpoint.c | 7 +- checkpoint/process.c | 20 ++- checkpoint/restart.c | 8 + include/linux/checkpoint.h | 9 + include/linux/checkpoint_hdr.h | 20 ++- 15 files changed, 801 insertions(+), 4 deletions(-) create mode 100644 arch/x86/include/asm/checkpoint_hdr.h create mode 100644 arch/x86/kernel/checkpoint.c create mode 100644 arch/x86/kernel/checkpoint_32.c diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 5eec1d9..738a930 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -478,6 +478,13 @@ quiet_ni_syscall: PTREGSCALL stub32_vfork, sys_vfork, %rdi PTREGSCALL stub32_iopl, sys_iopl, %rsi PTREGSCALL stub32_eclone, sys_eclone, %r8 +#ifdef CONFIG_CHECKPOINT + PTREGSCALL stub32_checkpoint, sys_checkpoint, %r8 + PTREGSCALL stub32_restart, sys_restart, %r8 +#else + PTREGSCALL stub32_checkpoint, sys_ni_syscall, %r8 + PTREGSCALL stub32_restart, sys_ni_syscall, %r8 +#endif ENTRY(ia32_ptregs_common) popq %r11 @@ -844,4 +851,6 @@ ia32_sys_call_table: .quad sys_perf_event_open .quad compat_sys_recvmmsg .quad stub32_eclone + .quad stub32_checkpoint + .quad stub32_restart /* 340 */ ia32_syscall_end: diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild index 9f828f8..3b90273 100644 --- a/arch/x86/include/asm/Kbuild +++ b/arch/x86/include/asm/Kbuild @@ -2,6 +2,7 @@ include include/asm-generic/Kbuild.asm header-y += boot.h header-y += bootparam.h +header-y += checkpoint_hdr.h header-y += debugreg.h header-y += ldt.h header-y += msr-index.h diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h new file mode 100644 index 0000000..e6cfc99 --- /dev/null +++ b/arch/x86/include/asm/checkpoint_hdr.h @@ -0,0 +1,112 @@ +#ifndef __ASM_X86_CKPT_HDR_H +#define __ASM_X86_CKPT_HDR_H +/* + * Checkpoint/restart - architecture specific headers x86 + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#ifndef _CHECKPOINT_CKPT_HDR_H_ +#error asm/checkpoint_hdr.h included directly +#endif + +#include <linux/types.h> + +/* + * To maintain compatibility between 32-bit and 64-bit architecture flavors, + * keep data 64-bit aligned: use padding for structure members, and use + * __attribute__((aligned (8))) for the entire structure. + * + * Quoting Arnd Bergmann: + * "This structure has an odd multiple of 32-bit members, which means + * that if you put it into a larger structure that also contains 64-bit + * members, the larger structure may get different alignment on x86-32 + * and x86-64, which you might want to avoid. I can't tell if this is + * an actual problem here. ... In this case, I'm pretty sure that + * sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it + * will be 32-bit aligned on x86-32." + */ + +/* i387 structure seen from kernel/userspace */ +#ifdef __KERNEL__ +#include <asm/processor.h> +#endif + +#ifdef CONFIG_X86_32 +#define CKPT_ARCH_ID CKPT_ARCH_X86_32 +#endif + +/* arch dependent header types */ +enum { + CKPT_HDR_CPU_FPU = 201, +#define CKPT_HDR_CPU_FPU CKPT_HDR_CPU_FPU +}; + +struct ckpt_hdr_header_arch { + struct ckpt_hdr h; + /* FIXME: add HAVE_HWFP */ + __u16 has_fxsr; + __u16 has_xsave; + __u16 xstate_size; + __u16 _pading; +} __attribute__((aligned(8))); + +struct ckpt_hdr_thread { + struct ckpt_hdr h; + __u32 thread_info_flags; + __u16 gdt_entry_tls_entries; + __u16 sizeof_tls_array; +} __attribute__((aligned(8))); + +/* designed to work for both x86_32 and x86_64 */ +struct ckpt_hdr_cpu { + struct ckpt_hdr h; + /* see struct pt_regs (x86_64) */ + __u64 r15; + __u64 r14; + __u64 r13; + __u64 r12; + __u64 bp; + __u64 bx; + __u64 r11; + __u64 r10; + __u64 r9; + __u64 r8; + __u64 ax; + __u64 cx; + __u64 dx; + __u64 si; + __u64 di; + __u64 orig_ax; + __u64 ip; + __u64 sp; + + __u64 flags; + + /* segment registers */ + __u64 fs; + __u64 gs; + + __u16 fsindex; + __u16 gsindex; + __u16 cs; + __u16 ss; + __u16 ds; + __u16 es; + + __u32 used_math; + + /* thread_xstate contents follow (if used_math) */ +} __attribute__((aligned(8))); + +#define CKPT_X86_SEG_NULL 0 +#define CKPT_X86_SEG_USER32_CS 1 +#define CKPT_X86_SEG_USER32_DS 2 +#define CKPT_X86_SEG_TLS 0x4000 /* 0100 0000 0000 00xx */ +#define CKPT_X86_SEG_LDT 0x8000 /* 100x xxxx xxxx xxxx */ + +#endif /* __ASM_X86_CKPT_HDR__H */ diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 972ab0e..c71262e 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -29,6 +29,12 @@ long sys_clone(unsigned long, unsigned long, void __user *, void __user *, struct pt_regs *); long sys_eclone(unsigned flags_low, struct clone_args __user *uca, int args_size, pid_t __user *pids, struct pt_regs *regs); +#ifdef CONFIG_CHECKPOINT +long sys_checkpoint(pid_t pid, int fd, unsigned long flags, + int logfd, struct pt_regs *regs); +long sys_restart(pid_t pid, int fd, unsigned long flags, + int logfd, struct pt_regs *regs); +#endif /* kernel/ldt.c */ asmlinkage int sys_modify_ldt(int, void __user *, unsigned long); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index a66ed15..55b7cae 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -344,6 +344,8 @@ #define __NR_perf_event_open 336 #define __NR_recvmmsg 337 #define __NR_eclone 338 +#define __NR_checkpoint 339 +#define __NR_restart 340 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index d87f09b..2f45350 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -116,6 +116,14 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o +obj-$(CONFIG_CHECKPOINT) += checkpoint.o + +### +# 32 bit specific files +ifeq ($(CONFIG_X86_32),y) + obj-$(CONFIG_CHECKPOINT) += checkpoint_32.o +endif + ### # 64 bit specific files ifeq ($(CONFIG_X86_64),y) diff --git a/arch/x86/kernel/checkpoint.c b/arch/x86/kernel/checkpoint.c new file mode 100644 index 0000000..06fe740 --- /dev/null +++ b/arch/x86/kernel/checkpoint.c @@ -0,0 +1,420 @@ +/* + * Checkpoint/restart - architecture specific support for x86 + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <asm/desc.h> +#include <asm/i387.h> + +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + + +/* + * sys_checkpoint needs to be a ptregscall to match sys_restart + * so self-checkpoint images can be restarted. + */ +long sys_checkpoint(pid_t pid, int fd, unsigned long flags, int logfd, + struct pt_regs *regs) +{ + return do_sys_checkpoint(pid, fd, flags, logfd); +} + +/* + * sys_restart needs to access and modify the pt_regs structure to + * restore the original state from the time of the checkpoint. + */ +long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd, + struct pt_regs *regs) +{ + return do_sys_restart(pid, fd, flags, logfd); +} + + +extern int check_segment(__u16 seg); +extern __u16 encode_segment(unsigned short seg); +extern unsigned short decode_segment(__u16 seg); +extern void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t); +extern int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t); + +static int check_tls(struct desc_struct *desc) +{ + if (!desc->a && !desc->b) + return 1; + if (desc->l != 0 || desc->s != 1 || desc->dpl != 3) + return 0; + return 1; +} + +#define CKPT_X86_TIF_UNSUPPORTED (_TIF_SECCOMP | _TIF_IO_BITMAP) + +/************************************************************************** + * Checkpoint + */ + +static int may_checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t) +{ +#ifdef CONFIG_X86_32 + if (t->thread.vm86_info) { + ckpt_err(ctx, -EBUSY, "%(T)Task in VM86 mode\n"); + return -EBUSY; + } +#endif + + /* debugregs not (yet) supported */ + if (test_tsk_thread_flag(t, TIF_DEBUG)) { + ckpt_err(ctx, -EBUSY, "%(T)Task with debugreg set\n"); + return -EBUSY; + } + + if (task_thread_info(t)->flags & CKPT_X86_TIF_UNSUPPORTED) { + ckpt_err(ctx, -EBUSY, "%(T)Bad thread info flags %#lx\n", + task_thread_info(t)->flags); + return -EBUSY; + } + return 0; +} + +/* dump the thread_struct of a given task */ +int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_thread *h; + int tls_size; + int ret; + + ret = may_checkpoint_thread(ctx, t); + if (ret < 0) + return ret; + + tls_size = sizeof(t->thread.tls_array); + + h = ckpt_hdr_get_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD); + if (!h) + return -ENOMEM; + + h->thread_info_flags = + task_thread_info(t)->flags & ~CKPT_X86_TIF_UNSUPPORTED; + h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES; + h->sizeof_tls_array = tls_size; + + /* For simplicity dump the entire array */ + memcpy(h + 1, t->thread.tls_array, tls_size); + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + return ret; +} + +static void save_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + /* + * FIXME: as of kernel 2.6.33 debug registers are handled via + * perf_event interface. For neither, neither is supported. + */ +} + +static void save_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + h->used_math = tsk_used_math(t) ? 1 : 0; +} + +static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr *h; + int ret; + + h = ckpt_hdr_get_type(ctx, xstate_size + sizeof(*h), + CKPT_HDR_CPU_FPU); + if (!h) + return -ENOMEM; + + /* i387 + MMU + SSE logic */ + preempt_disable(); /* needed it (t == current) */ + + /* + * normally, no need to unlazy_fpu(), since TS_USEDFPU flag + * was cleared when task was context-switched out... + * except if we are in process context, in which case we do + */ + if (t == current && (task_thread_info(t)->status & TS_USEDFPU)) + unlazy_fpu(current); + + /* + * For simplicity dump the entire structure. + * FIX: need to be deliberate about what registers we are + * dumping for traceability and compatibility. + */ + memcpy(h + 1, t->thread.xstate, xstate_size); + preempt_enable(); /* needed if (t == current) */ + + ret = ckpt_write_obj(ctx, h); + ckpt_hdr_put(ctx, h); + + return ret; +} + +/* dump the cpu state and registers of a given task */ +int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_cpu *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU); + if (!h) + return -ENOMEM; + + save_cpu_regs(h, t); + save_cpu_debug(h, t); + save_cpu_fpu(h, t); + + ckpt_debug("math %d\n", h->used_math); + + ret = ckpt_write_obj(ctx, &h->h); + if (ret < 0) + goto out; + + if (h->used_math) + ret = checkpoint_cpu_fpu(ctx, t); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +int checkpoint_write_header_arch(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_header_arch *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH); + if (!h) + return -ENOMEM; + + /* FPU capabilities */ + h->has_fxsr = cpu_has_fxsr; + h->has_xsave = cpu_has_xsave; + h->xstate_size = xstate_size; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + return ret; +} + +/************************************************************************** + * Restart + */ + +/* read the thread_struct into the current task */ +int restore_thread(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_thread *h; + struct thread_struct *thread = ¤t->thread; + struct desc_struct *desc; + int tls_size; + int i, cpu, ret; + + tls_size = sizeof(thread->tls_array); + + h = ckpt_read_obj_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD); + if (IS_ERR(h)) + return PTR_ERR(h); + + ret = -EINVAL; + if (h->thread_info_flags & CKPT_X86_TIF_UNSUPPORTED) + goto out; + if (h->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES) + goto out; + if (h->sizeof_tls_array != tls_size) + goto out; + + /* + * restore TLS by hand: why convert to struct user_desc if + * sys_set_thread_entry() will convert it back ? + */ + desc = (struct desc_struct *) (h + 1); + + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) { + if (!check_tls(&desc[i])) + goto out; + } + + cpu = get_cpu(); + memcpy(thread->tls_array, desc, tls_size); + load_TLS(thread, cpu); + put_cpu(); + + /* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */ + + ret = 0; + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static int load_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + /* + * FIXME: as of kernel 2.6.33 debug registers are handled via + * perf_event interface. For neither, neither is supported. + */ + + return 0; +} + +static int load_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + preempt_disable(); + + __clear_fpu(t); /* in case we used FPU in user mode */ + + if (!h->used_math) + clear_used_math(); + + preempt_enable(); + return 0; +} + +static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr *h; + int ret; + + /* init_fpu() eventually also calls set_used_math() */ + ret = init_fpu(current); + if (ret < 0) + return ret; + + h = ckpt_read_obj_type(ctx, xstate_size + sizeof(*h), + CKPT_HDR_CPU_FPU); + if (IS_ERR(h)) + return PTR_ERR(h); + + memcpy(t->thread.xstate, h + 1, xstate_size); + + ckpt_hdr_put(ctx, h); + return ret; +} + +static int check_eflags(__u32 eflags) +{ +#define X86_EFLAGS_CKPT_MASK \ + (X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \ + X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \ + X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID | X86_EFLAGS_RF) + + if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2)) + return 0; + return 1; +} + +static void restore_eflags(struct pt_regs *regs, __u32 eflags) +{ + /* + * A task may have had X86_EFLAGS_RF set at checkpoint, .e.g: + * 1) It ran in a KVM guest, and the guest was being debugged, + * 2) The kernel was debugged using kgbd, + * 3) From Intel's manual: "When calling an event handler, + * Intel 64 and IA-32 processors establish the value of the + * RF flag in the EFLAGS image pushed on the stack: + * - For any fault-class exception except a debug exception + * generated in response to an instruction breakpoint, the + * value pushed for RF is 1. + * - For any interrupt arriving after any iteration of a + * repeated string instruction but the last iteration, the + * value pushed for RF is 1. + * - For any trap-class exception generated by any iteration + * of a repeated string instruction but the last iteration, + * the value pushed for RF is 1. + * - For other cases, the value pushed for RF is the value + * that was in EFLAG.RF at the time the event handler was + * called. + * [from: http://www.intel.com/Assets/PDF/manual/253668.pdf] + * + * The RF flag may be set in EFLAGS by the hardware, or by + * kvm/kgdb, or even by the user with ptrace or by setting a + * suitable context when returning from a signal handler. + * + * Therefore, on restart we (1) prserve X86_EFLAGS_RF from + * checkpoint time, and (2) preserve a X86_EFLAGS_RF of the + * restarting process if it already exists on saved EFLAGS. + * Disable preemption to protect EFLAG test-and-change. + */ + preempt_disable(); + eflags |= (regs->flags & X86_EFLAGS_RF); + regs->flags = eflags; + preempt_enable(); +} + +static int load_cpu_eflags(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct pt_regs *regs = task_pt_regs(t); + + if (!check_eflags(h->flags)) + return -EINVAL; + restore_eflags(regs, h->flags); + return 0; +} + +/* read the cpu state and registers for the current task */ +int restore_cpu(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_cpu *h; + struct task_struct *t = current; + int ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU); + if (IS_ERR(h)) + return PTR_ERR(h); + + ckpt_debug("math %d\n", h->used_math); + + ret = load_cpu_regs(h, t); + if (ret < 0) + goto out; + ret = load_cpu_eflags(h, t); + if (ret < 0) + goto out; + ret = load_cpu_debug(h, t); + if (ret < 0) + goto out; + ret = load_cpu_fpu(h, t); + if (ret < 0) + goto out; + + if (h->used_math) + ret = restore_cpu_fpu(ctx, t); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +int restore_read_header_arch(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_header_arch *h; + int ret = 0; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH); + if (IS_ERR(h)) + return PTR_ERR(h); + + /* FIX: verify compatibility of architecture features */ + + /* verify FPU capabilities */ + if (h->has_fxsr != cpu_has_fxsr || + h->has_xsave != cpu_has_xsave || + h->xstate_size != xstate_size) { + ret = -EINVAL; + ckpt_err(ctx, ret, "incompatible FPU capabilities"); + } + + ckpt_hdr_put(ctx, h); + return ret; +} diff --git a/arch/x86/kernel/checkpoint_32.c b/arch/x86/kernel/checkpoint_32.c new file mode 100644 index 0000000..32cde34 --- /dev/null +++ b/arch/x86/kernel/checkpoint_32.c @@ -0,0 +1,173 @@ +/* + * Checkpoint/restart - architecture specific support for x86_32 + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <asm/desc.h> +#include <asm/i387.h> +#include <asm/elf.h> + +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +/* helpers to encode/decode/validate segments */ + +static int check_segment(__u16 seg) +{ + int ret = 0; + + switch (seg) { + case CKPT_X86_SEG_NULL: + case CKPT_X86_SEG_USER32_CS: + case CKPT_X86_SEG_USER32_DS: + return 1; + } + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN) + ret = 1; + } else if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + if (seg <= 0x1fff) + ret = 1; + } + return ret; +} + +static __u16 encode_segment(unsigned short seg) +{ + if (seg == 0) + return CKPT_X86_SEG_NULL; + BUG_ON((seg & 3) != 3); + + if (seg == __USER_CS) + return CKPT_X86_SEG_USER32_CS; + if (seg == __USER_DS) + return CKPT_X86_SEG_USER32_DS; + + if (seg & 4) + return CKPT_X86_SEG_LDT | (seg >> 3); + + seg >>= 3; + if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX) + return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN); + + printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg); + BUG(); +} + +static unsigned short decode_segment(__u16 seg) +{ + if (seg == CKPT_X86_SEG_NULL) + return 0; + if (seg == CKPT_X86_SEG_USER32_CS) + return __USER_CS; + if (seg == CKPT_X86_SEG_USER32_DS) + return __USER_DS; + + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3; + } + if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + return (seg << 3) | 7; + } + BUG(); +} + +void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct thread_struct *thread = &t->thread; + struct pt_regs *regs = task_pt_regs(t); + unsigned long _gs; + + h->bp = regs->bp; + h->bx = regs->bx; + h->ax = regs->ax; + h->cx = regs->cx; + h->dx = regs->dx; + h->si = regs->si; + h->di = regs->di; + h->orig_ax = regs->orig_ax; + h->ip = regs->ip; + + h->flags = regs->flags; + h->sp = regs->sp; + + h->cs = encode_segment(regs->cs); + h->ss = encode_segment(regs->ss); + h->ds = encode_segment(regs->ds); + h->es = encode_segment(regs->es); + + /* + * for checkpoint in process context (from within a container) + * the GS segment register should be saved from the hardware; + * otherwise it is already saved on the thread structure + */ + if (t == current) + _gs = get_user_gs(regs); + else + _gs = thread->gs; + + h->fsindex = encode_segment(regs->fs); + h->gsindex = encode_segment(_gs); + + /* + * for checkpoint in process context (from within a container), + * the actual syscall is taking place at this very moment; so + * we (optimistically) subtitute the future return value (0) of + * this syscall into the orig_eax, so that upon restart it will + * succeed (or it will endlessly retry checkpoint...) + */ + if (t == current) { + BUG_ON(h->orig_ax < 0); + h->ax = 0; + } +} + +int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct thread_struct *thread = &t->thread; + struct pt_regs *regs = task_pt_regs(t); + + if (h->cs == CKPT_X86_SEG_NULL) + return -EINVAL; + if (!check_segment(h->cs) || !check_segment(h->ds) || + !check_segment(h->es) || !check_segment(h->ss) || + !check_segment(h->fsindex) || !check_segment(h->gsindex)) + return -EINVAL; + + regs->bp = h->bp; + regs->bx = h->bx; + regs->ax = h->ax; + regs->cx = h->cx; + regs->dx = h->dx; + regs->si = h->si; + regs->di = h->di; + regs->orig_ax = h->orig_ax; + regs->ip = h->ip; + + regs->sp = h->sp; + + regs->ds = decode_segment(h->ds); + regs->es = decode_segment(h->es); + regs->cs = decode_segment(h->cs); + regs->ss = decode_segment(h->ss); + + regs->fs = decode_segment(h->fsindex); + regs->gs = decode_segment(h->gsindex); + + thread->gs = regs->gs; + lazy_load_gs(regs->gs); + + return 0; +} diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 65e1735..49d6628 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -781,6 +781,14 @@ PTREGSCALL0(rt_sigreturn) PTREGSCALL2(vm86) PTREGSCALL1(vm86old) PTREGSCALL4(eclone) +#ifdef CONFIG_CHECKPOINT +PTREGSCALL4(checkpoint) +PTREGSCALL4(restart) +#else +/* Use the weak defs in kernel/sys_ni.c */ +#define ptregs_checkpoint sys_checkpoint +#define ptregs_restart sys_restart +#endif /* Clone is an oddball. The 4th arg is in %edi */ ALIGN; diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index 22ae7ef..dc81ec9 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -338,3 +338,5 @@ ENTRY(sys_call_table) .long sys_perf_event_open .long sys_recvmmsg .long ptregs_eclone + .long ptregs_checkpoint + .long ptregs_restart /* 340 */ diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 2f8b038..c74b21e 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -126,6 +126,8 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx) do_gettimeofday(&ktv); uts = utsname(); + h->arch_id = cpu_to_le16(CKPT_ARCH_ID); /* see asm/checkpoitn.h */ + h->magic = CHECKPOINT_MAGIC_HEAD; h->major = (LINUX_VERSION_CODE >> 16) & 0xff; h->minor = (LINUX_VERSION_CODE >> 8) & 0xff; @@ -153,7 +155,10 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx) ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine)); up: up_read(&uts_sem); - return ret; + if (ret < 0) + return ret; + + return checkpoint_write_header_arch(ctx); } /* write the container configuration section */ diff --git a/checkpoint/process.c b/checkpoint/process.c index d221c2a..f6fb9d1 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -56,7 +56,15 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) ret = checkpoint_task_struct(ctx, t); ckpt_debug("task %d\n", ret); - + if (ret < 0) + goto out; + ret = checkpoint_thread(ctx, t); + ckpt_debug("thread %d\n", ret); + if (ret < 0) + goto out; + ret = checkpoint_cpu(ctx, t); + ckpt_debug("cpu %d\n", ret); + out: ctx->tsk = NULL; return ret; } @@ -97,6 +105,14 @@ int restore_task(struct ckpt_ctx *ctx) ret = restore_task_struct(ctx); ckpt_debug("task %d\n", ret); - + if (ret < 0) + goto out; + ret = restore_thread(ctx); + ckpt_debug("thread %d\n", ret); + if (ret < 0) + goto out; + ret = restore_cpu(ctx); + ckpt_debug("cpu %d\n", ret); + out: return ret; } diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 29e051c..38a9b04 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -368,6 +368,10 @@ static int restore_read_header(struct ckpt_ctx *ctx) return PTR_ERR(h); ret = -EINVAL; + if (le16_to_cpu(h->arch_id) != CKPT_ARCH_ID) { + ckpt_err(ctx, ret, "incompatible architecture id"); + goto out; + } if (h->magic != CHECKPOINT_MAGIC_HEAD || h->rev != CHECKPOINT_VERSION || h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) || @@ -402,6 +406,10 @@ static int restore_read_header(struct ckpt_ctx *ctx) if (ret < 0) goto out; ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine)); + if (ret < 0) + goto out; + + ret = restore_read_header_arch(ctx); out: kfree(uts); ckpt_hdr_put(ctx, h); diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 8591f79..3095431 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -68,6 +68,15 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid); extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_task(struct ckpt_ctx *ctx); +/* arch hooks */ +extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx); +extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t); + +extern int restore_read_header_arch(struct ckpt_ctx *ctx); +extern int restore_thread(struct ckpt_ctx *ctx); +extern int restore_cpu(struct ckpt_ctx *ctx); + static inline int ckpt_validate_errno(int errno) { return (errno >= 0) && (errno < MAX_ERRNO); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 97330ec..2ab878a 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -48,10 +48,16 @@ struct ckpt_hdr { __u32 len; } __attribute__((aligned(8))); + +#include <asm/checkpoint_hdr.h> + + /* header types */ enum { CKPT_HDR_HEADER = 1, #define CKPT_HDR_HEADER CKPT_HDR_HEADER + CKPT_HDR_HEADER_ARCH, +#define CKPT_HDR_HEADER_ARCH CKPT_HDR_HEADER_ARCH CKPT_HDR_CONTAINER, #define CKPT_HDR_CONTAINER CKPT_HDR_CONTAINER CKPT_HDR_BUFFER, @@ -61,6 +67,12 @@ enum { CKPT_HDR_TASK = 101, #define CKPT_HDR_TASK CKPT_HDR_TASK + CKPT_HDR_THREAD, +#define CKPT_HDR_THREAD CKPT_HDR_THREAD + CKPT_HDR_CPU, +#define CKPT_HDR_CPU CKPT_HDR_CPU + + /* 201-299: reserved for arch-dependent */ CKPT_HDR_TAIL = 9001, #define CKPT_HDR_TAIL CKPT_HDR_TAIL @@ -69,6 +81,12 @@ enum { #define CKPT_HDR_ERROR CKPT_HDR_ERROR }; +/* architecture */ +enum { + CKPT_ARCH_X86_32 = 1, +#define CKPT_ARCH_X86_32 CKPT_ARCH_X86_32 +}; + /* kernel constants */ struct ckpt_const { /* task */ @@ -84,7 +102,7 @@ struct ckpt_hdr_header { struct ckpt_hdr h; __u64 magic; - __u16 _padding; + __u16 arch_id; __u16 major; __u16 minor; -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 25/96] c/r: x86-64: checkpoint/restart implementation 2010-03-17 16:08 ` [C/R v20][PATCH 24/96] c/r: x86_32 support " Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 26/96] c/r: external checkpoint of a task other than ourself Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Support for checkpoint and restart for X86_32 architecture. Partly based on Alexey's work. Support for 32bit on 64bit and fixes from Serge Hallyn. Checkpoint Restart (app/arch) (app/arch/program*) --------------------------------------- 64/x86-64 -> 64/x86-64 works 32/x86-64 -> 32/x86-64 works 32/x86-64 -> 32/x86-32 ? 32/x86-32 -> 32/x86-64 ? 32/x86-64 -> 32/x86-32 ? 32/x86-32 -> 32/x86-64 ? (*) "program" indicates the bit-ness of 'restart' executable. Changelog[v19-rc3]: - Rebase to kernel 2.6.33 - [Serge Hallyn] Changes to fs/gs register handling - [Serge Hallyn] Allow 32-bit restart of 64-bit and vice versa - [Serge Hallyn] Only allow 'restart' with same bit-ness as image. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Signed-off-by: Serge Hallyn <serue@us.ibm.com> --- arch/x86/Kconfig | 2 +- arch/x86/include/asm/checkpoint_hdr.h | 6 + arch/x86/include/asm/unistd_64.h | 4 + arch/x86/kernel/Makefile | 2 + arch/x86/kernel/checkpoint.c | 16 +++ arch/x86/kernel/checkpoint_64.c | 241 +++++++++++++++++++++++++++++++++ arch/x86/kernel/entry_64.S | 7 + include/linux/checkpoint_hdr.h | 2 + 8 files changed, 279 insertions(+), 1 deletions(-) create mode 100644 arch/x86/kernel/checkpoint_64.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d5a7284..a6ae38a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -93,7 +93,7 @@ config HAVE_LATENCYTOP_SUPPORT config CHECKPOINT_SUPPORT bool - default y if X86_32 + default y config MMU def_bool y diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h index e6cfc99..6f600dd 100644 --- a/arch/x86/include/asm/checkpoint_hdr.h +++ b/arch/x86/include/asm/checkpoint_hdr.h @@ -36,6 +36,10 @@ #include <asm/processor.h> #endif +#ifdef CONFIG_X86_64 +#define CKPT_ARCH_ID CKPT_ARCH_X86_64 +#endif + #ifdef CONFIG_X86_32 #define CKPT_ARCH_ID CKPT_ARCH_X86_32 #endif @@ -106,6 +110,8 @@ struct ckpt_hdr_cpu { #define CKPT_X86_SEG_NULL 0 #define CKPT_X86_SEG_USER32_CS 1 #define CKPT_X86_SEG_USER32_DS 2 +#define CKPT_X86_SEG_USER64_CS 3 +#define CKPT_X86_SEG_USER64_DS 4 #define CKPT_X86_SEG_TLS 0x4000 /* 0100 0000 0000 00xx */ #define CKPT_X86_SEG_LDT 0x8000 /* 100x xxxx xxxx xxxx */ diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index d87318d..17bacfd 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -665,6 +665,10 @@ __SYSCALL(__NR_perf_event_open, sys_perf_event_open) __SYSCALL(__NR_recvmmsg, sys_recvmmsg) #define __NR_eclone 300 __SYSCALL(__NR_eclone, stub_eclone) +#define __NR_checkpoint 301 +__SYSCALL(__NR_checkpoint, stub_checkpoint) +#define __NR_restart 302 +__SYSCALL(__NR_restart, stub_restart) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 2f45350..2d0ff56 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -137,4 +137,6 @@ ifeq ($(CONFIG_X86_64),y) obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o obj-y += vsmp_64.o + + obj-$(CONFIG_CHECKPOINT) += checkpoint_64.o endif diff --git a/arch/x86/kernel/checkpoint.c b/arch/x86/kernel/checkpoint.c index 06fe740..53b7e66 100644 --- a/arch/x86/kernel/checkpoint.c +++ b/arch/x86/kernel/checkpoint.c @@ -251,6 +251,22 @@ int restore_thread(struct ckpt_ctx *ctx) load_TLS(thread, cpu); put_cpu(); + { + int pre, post; + /* + * Eventually we'd like to support mixed-bit restart, but for + * now don't pretend to. + */ + pre = test_thread_flag(TIF_IA32); + post = !!(h->thread_info_flags & _TIF_IA32); + if (pre != post) { + ret = -EINVAL; + ckpt_err(ctx, ret, "%d-bit restarting %d-bit\n", + 64 >> pre, 64 >> post); + goto out; + } + } + /* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */ ret = 0; diff --git a/arch/x86/kernel/checkpoint_64.c b/arch/x86/kernel/checkpoint_64.c new file mode 100644 index 0000000..f8226e2 --- /dev/null +++ b/arch/x86/kernel/checkpoint_64.c @@ -0,0 +1,241 @@ +/* + * Checkpoint/restart - architecture specific support for x86_64 + * + * Copyright (C) 2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <asm/desc.h> +#include <asm/i387.h> +#include <asm/elf.h> + +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +/* helpers to encode/decode/validate segments */ + +int check_segment(__u16 seg) +{ + int ret = 0; + + switch (seg) { + case CKPT_X86_SEG_NULL: + case CKPT_X86_SEG_USER64_CS: + case CKPT_X86_SEG_USER64_DS: +#ifdef CONFIG_COMPAT + case CKPT_X86_SEG_USER32_CS: + case CKPT_X86_SEG_USER32_DS: +#endif + return 1; + } + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN) + ret = 1; + } else if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + if (seg <= 0x1fff) + ret = 1; + } + return ret; +} + +__u16 encode_segment(unsigned short seg) +{ + if (seg == 0) + return CKPT_X86_SEG_NULL; + BUG_ON((seg & 3) != 3); + + if (seg == __USER_CS) + return CKPT_X86_SEG_USER64_CS; + if (seg == __USER_DS) + return CKPT_X86_SEG_USER64_DS; +#ifdef CONFIG_COMPAT + if (seg == __USER32_CS) + return CKPT_X86_SEG_USER32_CS; + if (seg == __USER32_DS) + return CKPT_X86_SEG_USER32_DS; +#endif + + if (seg & 4) + return CKPT_X86_SEG_LDT | (seg >> 3); + + seg >>= 3; + if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX) + return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN); + + printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg); + BUG(); +} + +unsigned short decode_segment(__u16 seg) +{ + if (seg == CKPT_X86_SEG_NULL) + return 0; + + if (seg == CKPT_X86_SEG_USER64_CS) + return __USER_CS; + if (seg == CKPT_X86_SEG_USER64_DS) + return __USER_DS; +#ifdef CONFIG_COMPAT + if (seg == CKPT_X86_SEG_USER32_CS) + return __USER32_CS; + if (seg == CKPT_X86_SEG_USER32_DS) + return __USER32_DS; +#endif + + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3; + } + if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + return (seg << 3) | 7; + } + BUG(); +} + +void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct pt_regs *regs = task_pt_regs(t); + unsigned long _ds, _es, _fs, _gs; + + h->r15 = regs->r15; + h->r14 = regs->r14; + h->r13 = regs->r13; + h->r12 = regs->r12; + h->r11 = regs->r11; + h->r10 = regs->r10; + h->r9 = regs->r9; + h->r8 = regs->r8; + + h->bp = regs->bp; + h->bx = regs->bx; + h->ax = regs->ax; + h->cx = regs->cx; + h->dx = regs->dx; + h->si = regs->si; + h->di = regs->di; + h->orig_ax = regs->orig_ax; + h->ip = regs->ip; + + h->flags = regs->flags; + h->sp = regs->sp; + + /* + * for checkpoint in process context (from within a container) + * DS, ES, FS, GS registers should be saved from the hardware; + * otherwise they are already saved on the thread structure + */ + + h->cs = encode_segment(regs->cs); + h->ss = encode_segment(regs->ss); + + if (t == current) { + savesegment(ds, _ds); + savesegment(es, _es); + savesegment(fs, _fs); + savesegment(gs, _gs); + rdmsrl(MSR_FS_BASE, h->fs); + rdmsrl(MSR_KERNEL_GS_BASE, h->gs); + } else { + _ds = t->thread.ds; + _es = t->thread.es; + _fs = t->thread.fsindex; + _gs = t->thread.gsindex; + h->fs = t->thread.fs; + h->gs = t->thread.gs; + } + h->ds = encode_segment(_ds); + h->es = encode_segment(_es); + h->fsindex = encode_segment(_fs); + h->gsindex = encode_segment(_gs); + + /* see comment in __switch_to() */ + if (_fs) + h->fs = 0; + if (_gs) + h->gs = 0; + + /* + * for checkpoint in process context (from within a container), + * the actual syscall is taking place at this very moment; so + * we (optimistically) subtitute the future return value (0) of + * this syscall into the orig_eax, so that upon restart it will + * succeed (or it will endlessly retry checkpoint...) + */ + if (t == current) { + BUG_ON(h->orig_ax < 0); + h->ax = 0; + } +} + +int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct thread_struct *thread = &t->thread; + struct pt_regs *regs = task_pt_regs(t); + + if (h->cs == CKPT_X86_SEG_NULL) + return -EINVAL; + if (!check_segment(h->cs) || !check_segment(h->ds) || + !check_segment(h->es) || !check_segment(h->ss) || + !check_segment(h->fsindex) || !check_segment(h->gsindex)) + return -EINVAL; + + regs->r15 = h->r15; + regs->r14 = h->r14; + regs->r13 = h->r13; + regs->r12 = h->r12; + regs->r11 = h->r11; + regs->r10 = h->r10; + regs->r9 = h->r9; + regs->r8 = h->r8; + + regs->bp = h->bp; + regs->bx = h->bx; + regs->ax = h->ax; + regs->cx = h->cx; + regs->dx = h->dx; + regs->si = h->si; + regs->di = h->di; + regs->orig_ax = h->orig_ax; + regs->ip = h->ip; + + regs->sp = h->sp; + thread->usersp = h->sp; + + preempt_disable(); + + regs->cs = decode_segment(h->cs); + regs->ss = decode_segment(h->ss); + thread->ds = decode_segment(h->ds); + thread->es = decode_segment(h->es); + thread->fsindex = decode_segment(h->fsindex); + thread->gsindex = decode_segment(h->gsindex); + + thread->fs = h->fs; + thread->gs = h->gs; + + /* XXX - unsure is this really needed ... */ + loadsegment(fs, thread->fsindex); + if (thread->fs) + wrmsrl(MSR_FS_BASE, thread->fs); + load_gs_index(thread->gsindex); + /* + * when we switch to user-space, the MSR_KERNEL_GS_BASE + * will be moved back to MSR_GS_BASE. + * http://lists.openwall.net/linux-kernel/2008/11/18/340 + */ + if (thread->gs) + wrmsrl(MSR_KERNEL_GS_BASE, thread->gs); + + preempt_enable(); + + return 0; +} diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 216681e..c2ece28 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -699,6 +699,13 @@ END(\label) PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx PTREGSCALL stub_iopl, sys_iopl, %rsi PTREGSCALL stub_eclone, sys_eclone, %r8 +#ifdef CONFIG_CHECKPOINT + PTREGSCALL stub_checkpoint, sys_checkpoint, %r8 + PTREGSCALL stub_restart, sys_restart, %r8 +#else + PTREGSCALL stub_checkpoint, sys_ni_syscall, %r8 + PTREGSCALL stub_restart, sys_ni_syscall, %r8 +#endif ENTRY(ptregscall_common) DEFAULT_FRAME 1 8 /* offset 8: return address */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 2ab878a..4627564 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -85,6 +85,8 @@ enum { enum { CKPT_ARCH_X86_32 = 1, #define CKPT_ARCH_X86_32 CKPT_ARCH_X86_32 + CKPT_ARCH_X86_64, +#define CKPT_ARCH_X86_64 CKPT_ARCH_X86_64 }; /* kernel constants */ -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 26/96] c/r: external checkpoint of a task other than ourself 2010-03-17 16:08 ` [C/R v20][PATCH 25/96] c/r: x86-64: checkpoint/restart implementation Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 27/96] c/r: export functionality used in next patch for restart-blocks Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Now we can do "external" checkpoint, i.e. act on another task. sys_checkpoint() now looks up the target pid (in our namespace) and checkpoints that corresponding task. That task should be the root of a container, unless CHECKPOINT_SUBTREE flag is given. Set state of freezer cgroup of checkpointed task hierarchy to "CHECKPOINTING" during a checkpoint, to ensure that task(s) cannot be thawed while at it. Ensure that all tasks belong to root task's freezer cgroup (the root task is also tested, to detect it if changes its freezer cgroups before it moves to "CHECKPOINTING"). sys_restart() remains nearly the same, as the restart is always done in the context of the restarting task. However, the original task may have been frozen from user space, or interrupted from a syscall for the checkpoint. This is accounted for by restoring a suitable retval for the restarting task, according to how it was checkpointed. Changelog[v20]: - [Nathan Lynch] Use syscall_get_error Changelog[v19-rc1]: - [Serge Hallyn] Add global section container to image format Changelog[v17]: - Move restore_retval() to this patch - Tighten ptrace ceckpoint for checkpoint to PTRACE_MODE_ATTACH - Use CHECKPOINTING state for hierarchy's freezer for checkpoint Changelog[v16]: - Use CHECKPOINT_SUBTREE to allow subtree (partial container) Changelog[v14]: - Refuse non-self checkpoint if target task isn't frozen Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Copy contents of 'init->fs->root' instead of pointing to them Changelog[v10]: - Grab vfs root of container init, rather than current process Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/Kconfig | 1 + checkpoint/checkpoint.c | 98 +++++++++++++++++++++++++++++++++++++- checkpoint/restart.c | 63 ++++++++++++++++++++++++- checkpoint/sys.c | 10 ++++ include/linux/checkpoint_types.h | 7 ++- 5 files changed, 176 insertions(+), 3 deletions(-) diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig index ef7d406..21fc86b 100644 --- a/checkpoint/Kconfig +++ b/checkpoint/Kconfig @@ -5,6 +5,7 @@ config CHECKPOINT bool "Checkpoint/restart (EXPERIMENTAL)" depends on CHECKPOINT_SUPPORT && EXPERIMENTAL + depends on CGROUP_FREEZER help Application checkpoint/restart is the ability to save the state of a running application so that it can later resume diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index c74b21e..695ab00 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -12,6 +12,9 @@ #define CKPT_DFLAG CKPT_DSYS #include <linux/version.h> +#include <linux/sched.h> +#include <linux/freezer.h> +#include <linux/ptrace.h> #include <linux/time.h> #include <linux/fs.h> #include <linux/file.h> @@ -193,17 +196,108 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx) return ret; } +static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) +{ + if (t->state == TASK_DEAD) { + _ckpt_err(ctx, -EBUSY, "%(T)Task state EXIT_DEAD\n"); + return -EBUSY; + } + + if (!ptrace_may_access(t, PTRACE_MODE_ATTACH)) { + _ckpt_err(ctx, -EPERM, "%(T)Ptrace attach denied\n"); + return -EPERM; + } + + /* verify that all tasks belongs to same freezer cgroup */ + if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) { + _ckpt_err(ctx, -EBUSY, "%(T)Not frozen or wrong cgroup\n"); + return -EBUSY; + } + + /* FIX: add support for ptraced tasks */ + if (task_ptrace(t)) { + _ckpt_err(ctx, -EBUSY, "%(T)Task is ptraced\n"); + return -EBUSY; + } + + return 0; +} + +/* setup checkpoint-specific parts of ctx */ +static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) +{ + struct task_struct *task; + struct nsproxy *nsproxy; + int ret; + + /* + * No need for explicit cleanup here, because if an error + * occurs then ckpt_ctx_free() is eventually called. + */ + + ctx->root_pid = pid; + + /* root task */ + read_lock(&tasklist_lock); + task = find_task_by_vpid(pid); + if (task) + get_task_struct(task); + read_unlock(&tasklist_lock); + if (!task) + return -ESRCH; + else + ctx->root_task = task; + + /* root nsproxy */ + rcu_read_lock(); + nsproxy = task_nsproxy(task); + if (nsproxy) + get_nsproxy(nsproxy); + rcu_read_unlock(); + if (!nsproxy) + return -ESRCH; + else + ctx->root_nsproxy = nsproxy; + + /* root freezer */ + ctx->root_freezer = task; + geT_task_struct(task); + + ret = may_checkpoint_task(ctx, task); + if (ret) { + _ckpt_msg_complete(ctx); + put_task_struct(task); + put_task_struct(task); + put_nsproxy(nsproxy); + ctx->root_nsproxy = NULL; + ctx->root_task = NULL; + return ret; + } + + return 0; +} + long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) { long ret; + ret = init_checkpoint_ctx(ctx, pid); + if (ret < 0) + return ret; + + if (ctx->root_freezer) { + ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer); + if (ret < 0) + return ret; + } + ret = checkpoint_write_header(ctx); if (ret < 0) goto out; ret = checkpoint_container(ctx); if (ret < 0) goto out; - ret = checkpoint_task(ctx, current); + ret = checkpoint_task(ctx, ctx->root_task); if (ret < 0) goto out; ret = checkpoint_write_tail(ctx); @@ -214,5 +308,7 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) ctx->crid = atomic_inc_return(&ctx_count); ret = ctx->crid; out: + if (ctx->root_freezer) + cgroup_freezer_end_checkpoint(ctx->root_freezer); return ret; } diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 38a9b04..11d9738 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -447,10 +447,69 @@ static int restore_read_tail(struct ckpt_ctx *ctx) return ret; } +static long restore_retval(void) +{ + struct pt_regs *regs = task_pt_regs(current); + long syscall_err; + long syscall_nr; + + /* + * For the restart, we entered the kernel via sys_restart(), + * so our return path is via the syscall exit. In particular, + * the code in entry.S will put the value that we will return + * into a register (e.g. regs->eax in x86), thus passing it to + * the caller task. + * + * What we do now depends on what happened to the checkpointed + * task right before the checkpoint - there are three cases: + * + * 1) It was carrying out a syscall when became frozen, or + * 2) It was running in userspace, or + * 3) It was doing a self-checkpoint + * + * In case #1, if the syscall succeeded, perhaps partially, + * then the retval is non-negative. If it failed, the error + * may be one of -ERESTART..., which is interpreted in the + * signal handling code. If that is the case, we force the + * signal handler to kick in by faking a signal to ourselves + * (a la freeze/thaw) when ret < 0. + * + * In case #2, our return value will overwrite the original + * value in the affected register. Workaround by simply using + * that saved value of that register as our retval. + * + * In case #3, then the state was recorded while the task was + * in checkpoint(2) syscall. The syscall is execpted to return + * 0 when returning from a restart. Fortunately, this already + * has been arranged for at checkpoint time (the register that + * holds the retval, e.g. regs->eax in x86, was set to + * zero). + */ + + /* needed for all 3 cases: get old value/error/retval */ + syscall_nr = syscall_get_nr(current, regs); + syscall_err = syscall_get_error(current, regs); + + /* if from a syscall and returning error, kick in signal handling */ + if (syscall_nr >= 0 && syscall_err != 0) + set_tsk_thread_flag(current, TIF_SIGPENDING); + + return syscall_get_return_value(current, regs); +} + +/* setup restart-specific parts of ctx */ +static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid) +{ + return 0; +} + long do_restart(struct ckpt_ctx *ctx, pid_t pid) { long ret; + ret = init_restart_ctx(ctx, pid); + if (ret < 0) + return ret; ret = restore_read_header(ctx); if (ret < 0) return ret; @@ -461,7 +520,9 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid) if (ret < 0) return ret; ret = restore_read_tail(ctx); + if (ret < 0) + return ret; /* on success, adjust the return value if needed [TODO] */ - return ret; + return restore_retval(ctx); } diff --git a/checkpoint/sys.c b/checkpoint/sys.c index f642485..308cd27 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -12,7 +12,9 @@ #define CKPT_DFLAG CKPT_DSYS #include <linux/sched.h> +#include <linux/nsproxy.h> #include <linux/kernel.h> +#include <linux/cgroup.h> #include <linux/syscalls.h> #include <linux/fs.h> #include <linux/file.h> @@ -173,6 +175,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) fput(ctx->file); if (ctx->logfile) fput(ctx->logfile); + + if (ctx->root_nsproxy) + put_nsproxy(ctx->root_nsproxy); + if (ctx->root_task) + put_task_struct(ctx->root_task); + if (ctx->root_freezer) + put_task_struct(ctx->root_freezer); + kfree(ctx); } diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 6327ad0..dc35b21 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -12,12 +12,17 @@ #ifdef __KERNEL__ +#include <linux/sched.h> +#include <linux/nsproxy.h> #include <linux/fs.h> struct ckpt_ctx { int crid; /* unique checkpoint id */ - pid_t root_pid; /* container identifier */ + pid_t root_pid; /* [container] root pid */ + struct task_struct *root_task; /* [container] root task */ + struct nsproxy *root_nsproxy; /* [container] root nsproxy */ + struct task_struct *root_freezer; /* [container] root task */ unsigned long kflags; /* kerenl flags */ unsigned long uflags; /* user flags */ -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 27/96] c/r: export functionality used in next patch for restart-blocks 2010-03-17 16:08 ` [C/R v20][PATCH 26/96] c/r: external checkpoint of a task other than ourself Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 28/96] c/r: restart-blocks Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan To support c/r of restart-blocks (system call that need to be restarted because they were interrupted but there was no userspace visible side-effect), export restart-block callbacks for poll() and futex() syscalls. More details on c/r of restart-blocks and how it works in the following patch. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- fs/select.c | 2 +- include/linux/futex.h | 11 +++++++++++ include/linux/poll.h | 3 +++ include/linux/posix-timers.h | 6 ++++++ kernel/compat.c | 4 ++-- kernel/futex.c | 12 +----------- kernel/posix-timers.c | 2 +- 7 files changed, 25 insertions(+), 15 deletions(-) diff --git a/fs/select.c b/fs/select.c index fd38ce2..7e3de2c 100644 --- a/fs/select.c +++ b/fs/select.c @@ -873,7 +873,7 @@ out_fds: return err; } -static long do_restart_poll(struct restart_block *restart_block) +long do_restart_poll(struct restart_block *restart_block) { struct pollfd __user *ufds = restart_block->poll.ufds; int nfds = restart_block->poll.nfds; diff --git a/include/linux/futex.h b/include/linux/futex.h index 1e5a26d..ae755f6 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -136,6 +136,17 @@ extern int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi); /* + * In case we must use restart_block to restart a futex_wait, + * we encode in the 'flags' shared capability + */ +#define FLAGS_SHARED 0x01 +#define FLAGS_CLOCKRT 0x02 +#define FLAGS_HAS_TIMEOUT 0x04 + +/* for c/r */ +extern long futex_wait_restart(struct restart_block *restart); + +/* * Futexes are matched on equal values of this key. * The key type depends on whether it's a shared or private mapping. * Don't rearrange members without looking at hash_futex(). diff --git a/include/linux/poll.h b/include/linux/poll.h index 6673743..6c40d42 100644 --- a/include/linux/poll.h +++ b/include/linux/poll.h @@ -134,6 +134,9 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec); +/* used by checkpoint/restart */ +extern long do_restart_poll(struct restart_block *restart_block); + #endif /* KERNEL */ #endif /* _LINUX_POLL_H */ diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index 4f71bf4..d0d6a66 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -101,6 +101,10 @@ int posix_cpu_timer_create(struct k_itimer *timer); int posix_cpu_nsleep(const clockid_t which_clock, int flags, struct timespec *rqtp, struct timespec __user *rmtp); long posix_cpu_nsleep_restart(struct restart_block *restart_block); +#ifdef CONFIG_COMPAT +long compat_nanosleep_restart(struct restart_block *restart); +long compat_clock_nanosleep_restart(struct restart_block *restart); +#endif int posix_cpu_timer_set(struct k_itimer *timer, int flags, struct itimerspec *new, struct itimerspec *old); int posix_cpu_timer_del(struct k_itimer *timer); @@ -119,4 +123,6 @@ long clock_nanosleep_restart(struct restart_block *restart_block); void update_rlimit_cpu(unsigned long rlim_new); +int invalid_clockid(const clockid_t which_clock); + #endif diff --git a/kernel/compat.c b/kernel/compat.c index f6c204f..20afdba 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -100,7 +100,7 @@ int put_compat_timespec(const struct timespec *ts, struct compat_timespec __user __put_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0; } -static long compat_nanosleep_restart(struct restart_block *restart) +long compat_nanosleep_restart(struct restart_block *restart) { struct compat_timespec __user *rmtp; struct timespec rmt; @@ -647,7 +647,7 @@ long compat_sys_clock_getres(clockid_t which_clock, return err; } -static long compat_clock_nanosleep_restart(struct restart_block *restart) +long compat_clock_nanosleep_restart(struct restart_block *restart) { long err; mm_segment_t oldfs; diff --git a/kernel/futex.c b/kernel/futex.c index e7a35f1..23419c9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1593,16 +1593,6 @@ handle_fault: goto retry; } -/* - * In case we must use restart_block to restart a futex_wait, - * we encode in the 'flags' shared capability - */ -#define FLAGS_SHARED 0x01 -#define FLAGS_CLOCKRT 0x02 -#define FLAGS_HAS_TIMEOUT 0x04 - -static long futex_wait_restart(struct restart_block *restart); - /** * fixup_owner() - Post lock pi_state and corner case management * @uaddr: user address of the futex @@ -1876,7 +1866,7 @@ out: } -static long futex_wait_restart(struct restart_block *restart) +long futex_wait_restart(struct restart_block *restart) { u32 __user *uaddr = (u32 __user *)restart->futex.uaddr; int fshared = 0; diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 4954407..86dcae4 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -211,7 +211,7 @@ static int no_nsleep(const clockid_t which_clock, int flags, /* * Return nonzero if we know a priori this clockid_t value is bogus. */ -static inline int invalid_clockid(const clockid_t which_clock) +int invalid_clockid(const clockid_t which_clock) { if (which_clock < 0) /* CPU clock, posix_cpu_* will check it */ return 0; -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 28/96] c/r: restart-blocks 2010-03-17 16:08 ` [C/R v20][PATCH 27/96] c/r: export functionality used in next patch for restart-blocks Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 29/96] c/r: checkpoint multiple processes Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan (Paraphrasing what's said this message: http://lists.openwall.net/linux-kernel/2007/12/05/64) Restart blocks are callbacks used cause a system call to be restarted with the arguments specified in the system call restart block. It is useful for system call that are not idempotent, i.e. the argument(s) might be a relative timeout, where some adjustments are required when restarting the system call. It relies on the system call itself to set up its restart point and the argument save area. They are rare: an actual signal would turn that it an EINTR. The only case that should ever trigger this is some kernel action that interrupts the system call, but does not actually result in any user-visible state changes - like freeze and thaw. So restart blocks are about time remaining for the system call to sleep/wait. Generally in c/r, there are two possible time models that we can follow: absolute, relative. Here, I chose to save the relative timeout, measured from the beginning of the checkpoint. The time when the checkpoint (and restart) begin is also saved. This information is sufficient to restart in either model (absolute or negative). Which model to use should eventually be a per application choice (and possible configurable via cradvise() or some sort). For now, we adopt the relative model, namely, at restart the timeout is set relative to the beginning of the restart. To checkpoint, we check if a task has a valid restart block, and if so we save the *remaining* time that is has to wait/sleep, and the type of the restart block. To restart, we fill in the data required at the proper place in the thread information. If the system call return an error (which is possibly an -ERESTARTSYS eg), we not only use that error as our own return value, but also arrange for the task to execute the signal handler (by faking a signal). The handler, in turn, already has the code to handle these restart request gracefully. Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog[v1]: - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/checkpoint.c | 1 + checkpoint/process.c | 226 ++++++++++++++++++++++++++++++++++++++ checkpoint/restart.c | 5 +- checkpoint/sys.c | 1 + include/linux/checkpoint.h | 4 + include/linux/checkpoint_hdr.h | 34 ++++++ include/linux/checkpoint_types.h | 3 + 7 files changed, 272 insertions(+), 2 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 695ab00..e25b9b7 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -22,6 +22,7 @@ #include <linux/mount.h> #include <linux/utsname.h> #include <linux/magic.h> +#include <linux/hrtimer.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> diff --git a/checkpoint/process.c b/checkpoint/process.c index f6fb9d1..9f2059c 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -12,6 +12,9 @@ #define CKPT_DFLAG CKPT_DSYS #include <linux/sched.h> +#include <linux/posix-timers.h> +#include <linux/futex.h> +#include <linux/poll.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -47,6 +50,116 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN); } +/* dump the task_struct of a given task */ +int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_restart_block *h; + struct restart_block *restart_block; + long (*fn)(struct restart_block *); + s64 base, expire = 0; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK); + if (!h) + return -ENOMEM; + + base = ktime_to_ns(ctx->ktime_begin); + restart_block = &task_thread_info(t)->restart_block; + fn = restart_block->fn; + + /* FIX: enumerate clockid_t so we're immune to changes */ + + if (fn == do_no_restart_syscall) { + + h->function_type = CKPT_RESTART_BLOCK_NONE; + ckpt_debug("restart_block: non\n"); + + } else if (fn == hrtimer_nanosleep_restart) { + + h->function_type = CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP; + h->arg_0 = restart_block->nanosleep.index; + h->arg_1 = (unsigned long) restart_block->nanosleep.rmtp; + expire = restart_block->nanosleep.expires; + ckpt_debug("restart_block: hrtimer expire %lld now %lld\n", + expire, base); + + } else if (fn == posix_cpu_nsleep_restart) { + struct timespec ts; + + h->function_type = CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP; + h->arg_0 = restart_block->arg0; + h->arg_1 = restart_block->arg1; + ts.tv_sec = restart_block->arg2; + ts.tv_nsec = restart_block->arg3; + expire = timespec_to_ns(&ts); + ckpt_debug("restart_block: posix_cpu expire %lld now %lld\n", + expire, base); + +#ifdef CONFIG_COMPAT + } else if (fn == compat_nanosleep_restart) { + + h->function_type = CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP; + h->arg_0 = restart_block->nanosleep.index; + h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp; + h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp; + expire = restart_block->nanosleep.expires; + ckpt_debug("restart_block: compat expire %lld now %lld\n", + expire, base); + + } else if (fn == compat_clock_nanosleep_restart) { + + h->function_type = CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP; + h->arg_0 = restart_block->nanosleep.index; + h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp; + h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp; + expire = restart_block->nanosleep.expires; + ckpt_debug("restart_block: compat_clock expire %lld now %lld\n", + expire, base); + +#endif + } else if (fn == futex_wait_restart) { + + h->function_type = CKPT_RESTART_BLOCK_FUTEX; + h->arg_0 = (unsigned long) restart_block->futex.uaddr; + h->arg_1 = restart_block->futex.val; + h->arg_2 = restart_block->futex.flags; + h->arg_3 = restart_block->futex.bitset; + expire = restart_block->futex.time; + ckpt_debug("restart_block: futex expire %lld now %lld\n", + expire, base); + + } else if (fn == do_restart_poll) { + struct timespec ts; + + h->function_type = CKPT_RESTART_BLOCK_POLL; + h->arg_0 = (unsigned long) restart_block->poll.ufds; + h->arg_1 = restart_block->poll.nfds; + h->arg_2 = restart_block->poll.has_timeout; + ts.tv_sec = restart_block->poll.tv_sec; + ts.tv_nsec = restart_block->poll.tv_nsec; + expire = timespec_to_ns(&ts); + ckpt_debug("restart_block: poll expire %lld now %lld\n", + expire, base); + + } else { + + BUG(); + + } + + /* common to all restart blocks: */ + h->arg_4 = (base < expire ? expire - base : 0); + + ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n", + h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4); + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + ckpt_debug("restart_block ret %d\n", ret); + return ret; +} + /* dump the entire state of a given task */ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) { @@ -62,6 +175,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) ckpt_debug("thread %d\n", ret); if (ret < 0) goto out; + ret = checkpoint_restart_block(ctx, t); + ckpt_debug("restart-blocks %d\n", ret); + if (ret < 0) + goto out; ret = checkpoint_cpu(ctx, t); ckpt_debug("cpu %d\n", ret); out: @@ -98,6 +215,111 @@ static int restore_task_struct(struct ckpt_ctx *ctx) return ret; } +int restore_restart_block(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_restart_block *h; + struct restart_block restart_block; + struct timespec ts; + clockid_t clockid; + s64 expire; + int ret = 0; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK); + if (IS_ERR(h)) + return PTR_ERR(h); + + expire = ktime_to_ns(ctx->ktime_begin) + h->arg_4; + restart_block.fn = NULL; + + ckpt_debug("restart_block: expire %lld begin %lld\n", + expire, ktime_to_ns(ctx->ktime_begin)); + ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n", + h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4); + + switch (h->function_type) { + case CKPT_RESTART_BLOCK_NONE: + restart_block.fn = do_no_restart_syscall; + break; + case CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP: + clockid = h->arg_0; + if (clockid < 0 || invalid_clockid(clockid)) + break; + restart_block.fn = hrtimer_nanosleep_restart; + restart_block.nanosleep.index = clockid; + restart_block.nanosleep.rmtp = + (struct timespec __user *) (unsigned long) h->arg_1; + restart_block.nanosleep.expires = expire; + break; + case CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP: + clockid = h->arg_0; + if (clockid < 0 || invalid_clockid(clockid)) + break; + restart_block.fn = posix_cpu_nsleep_restart; + restart_block.arg0 = clockid; + restart_block.arg1 = h->arg_1; + ts = ns_to_timespec(expire); + restart_block.arg2 = ts.tv_sec; + restart_block.arg3 = ts.tv_nsec; + break; +#ifdef CONFIG_COMPAT + case CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP: + clockid = h->arg_0; + if (clockid < 0 || invalid_clockid(clockid)) + break; + restart_block.fn = compat_nanosleep_restart; + restart_block.nanosleep.index = clockid; + restart_block.nanosleep.rmtp = + (struct timespec __user *) (unsigned long) h->arg_1; + restart_block.nanosleep.compat_rmtp = + (struct compat_timespec __user *) + (unsigned long) h->arg_2; + restart_block.nanosleep.expires = expire; + break; + case CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP: + clockid = h->arg_0; + if (clockid < 0 || invalid_clockid(clockid)) + break; + restart_block.fn = compat_clock_nanosleep_restart; + restart_block.nanosleep.index = clockid; + restart_block.nanosleep.rmtp = + (struct timespec __user *) (unsigned long) h->arg_1; + restart_block.nanosleep.compat_rmtp = + (struct compat_timespec __user *) + (unsigned long) h->arg_2; + restart_block.nanosleep.expires = expire; + break; +#endif + case CKPT_RESTART_BLOCK_FUTEX: + restart_block.fn = futex_wait_restart; + restart_block.futex.uaddr = (u32 *) (unsigned long) h->arg_0; + restart_block.futex.val = h->arg_1; + restart_block.futex.flags = h->arg_2; + restart_block.futex.bitset = h->arg_3; + restart_block.futex.time = expire; + break; + case CKPT_RESTART_BLOCK_POLL: + restart_block.fn = do_restart_poll; + restart_block.poll.ufds = + (struct pollfd __user *) (unsigned long) h->arg_0; + restart_block.poll.nfds = h->arg_1; + restart_block.poll.has_timeout = h->arg_2; + ts = ns_to_timespec(expire); + restart_block.poll.tv_sec = ts.tv_sec; + restart_block.poll.tv_nsec = ts.tv_nsec; + break; + default: + break; + } + + if (restart_block.fn) + task_thread_info(current)->restart_block = restart_block; + else + ret = -EINVAL; + + ckpt_hdr_put(ctx, h); + return ret; +} + /* read the entire state of the current task */ int restore_task(struct ckpt_ctx *ctx) { @@ -111,6 +333,10 @@ int restore_task(struct ckpt_ctx *ctx) ckpt_debug("thread %d\n", ret); if (ret < 0) goto out; + ret = restore_restart_block(ctx); + ckpt_debug("restart-blocks %d\n", ret); + if (ret < 0) + goto out; ret = restore_cpu(ctx); ckpt_debug("cpu %d\n", ret); out: diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 11d9738..360c41e 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -16,6 +16,8 @@ #include <linux/file.h> #include <linux/magic.h> #include <linux/utsname.h> +#include <asm/syscall.h> +#include <linux/elf.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -523,6 +525,5 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid) if (ret < 0) return ret; - /* on success, adjust the return value if needed [TODO] */ - return restore_retval(ctx); + return restore_retval(); } diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 308cd27..d858096 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -198,6 +198,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, ctx->uflags = uflags; ctx->kflags = kflags; + ctx->ktime_begin = ktime_get(); mutex_init(&ctx->msg_mutex); diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 3095431..8cb6130 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -77,6 +77,10 @@ extern int restore_read_header_arch(struct ckpt_ctx *ctx); extern int restore_thread(struct ckpt_ctx *ctx); extern int restore_cpu(struct ckpt_ctx *ctx); +extern int checkpoint_restart_block(struct ckpt_ctx *ctx, + struct task_struct *t); +extern int restore_restart_block(struct ckpt_ctx *ctx); + static inline int ckpt_validate_errno(int errno) { return (errno >= 0) && (errno < MAX_ERRNO); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 4627564..24e880f 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -67,6 +67,8 @@ enum { CKPT_HDR_TASK = 101, #define CKPT_HDR_TASK CKPT_HDR_TASK + CKPT_HDR_RESTART_BLOCK, +#define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK CKPT_HDR_THREAD, #define CKPT_HDR_THREAD CKPT_HDR_THREAD CKPT_HDR_CPU, @@ -147,4 +149,36 @@ struct ckpt_hdr_task { __u64 clear_child_tid; } __attribute__((aligned(8))); +/* restart blocks */ +struct ckpt_hdr_restart_block { + struct ckpt_hdr h; + __u64 function_type; + __u64 arg_0; + __u64 arg_1; + __u64 arg_2; + __u64 arg_3; + __u64 arg_4; +} __attribute__((aligned(8))); + +enum restart_block_type { + CKPT_RESTART_BLOCK_NONE = 1, +#define CKPT_RESTART_BLOCK_NONE CKPT_RESTART_BLOCK_NONE + CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP, +#define CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP \ + CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP + CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP, +#define CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP \ + CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP + CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP, +#define CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP \ + CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP + CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP, +#define CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP \ + CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP + CKPT_RESTART_BLOCK_POLL, +#define CKPT_RESTART_BLOCK_POLL CKPT_RESTART_BLOCK_POLL + CKPT_RESTART_BLOCK_FUTEX, +#define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX +}; + #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index dc35b21..6420a3b 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -15,10 +15,13 @@ #include <linux/sched.h> #include <linux/nsproxy.h> #include <linux/fs.h> +#include <linux/ktime.h> struct ckpt_ctx { int crid; /* unique checkpoint id */ + ktime_t ktime_begin; /* checkpoint start time */ + pid_t root_pid; /* [container] root pid */ struct task_struct *root_task; /* [container] root task */ struct nsproxy *root_nsproxy; /* [container] root nsproxy */ -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 29/96] c/r: checkpoint multiple processes 2010-03-17 16:08 ` [C/R v20][PATCH 28/96] c/r: restart-blocks Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 30/96] c/r: restart " Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Checkpointing of multiple processes works by recording the tasks tree structure below a given "root" task. The root task is expected to be a container init, and then an entire container is checkpointed. However, passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement and allows to checkpoint a subtree of processes from the root task. For a given root task, do a DFS scan of the tasks tree and collect them into an array (keeping a reference to each task). Using DFS simplifies the recreation of tasks either in user space or kernel space. For each task collected, test if it can be checkpointed, and save its pid, tgid, and ppid. The actual work is divided into two passes: a first scan counts the tasks, then memory is allocated and a second scan fills the array. Whether checkpoints and restarts require CAP_SYS_ADMIN is determined by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks are intended to prevent privilege escalation, however if 0 it prevents unprivileged users from exploiting any privilege escalation bugs. The logic is suitable for creation of processes during restart either in userspace or by the kernel. Currently we ignore threads and zombies. Changelog[v20]: - [Serge Hallyn] Change sysctl and default for unprivileged use Changelog[v19-rc3]: - Rebase to kernel 2.6.33 (fix sysctl entry for ckpt_unpriv_allowed) Changelog[v19-rc1]: - Introduce walk_task_subtree() to iterate through descendants - [Matt Helsley] Add cpp definitions for enums - [Serge Hallyn] Add global section container to image format Changelog[v18]: - Replace some EAGAIN with EBUSY - Add a few more ckpt_write_err()s - Rename headerless struct ckpt_hdr_* to struct ckpt_* Changelog[v16]: - CHECKPOINT_SUBTREE flags allows subtree (not whole container) - sysctl variable 'ckpt_unpriv_allowed' controls needed privileges Changelog[v14]: - Refuse non-self checkpoint if target task isn't frozen - Refuse checkpoint (for now) if task is ptraced - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Check retval of ckpt_tree_count_tasks() in ckpt_build_tree() - Discard 'h.parent' field - Check whether calls to ckpt_hbuf_get() fail - Disallow threads or siblings to container init Changelog[v13]: - Release tasklist_lock in error path in ckpt_tree_count_tasks() - Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids() Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/checkpoint.c | 271 ++++++++++++++++++++++++++++++++++++-- checkpoint/restart.c | 2 +- checkpoint/sys.c | 108 +++++++++++++++- include/linux/checkpoint.h | 10 ++ include/linux/checkpoint_hdr.h | 18 +++- include/linux/checkpoint_types.h | 4 + kernel/sysctl.c | 16 +++ 7 files changed, 411 insertions(+), 18 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index e25b9b7..ba566b0 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -197,8 +197,27 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx) return ret; } +/* dump all tasks in ctx->tasks_arr[] */ +static int checkpoint_all_tasks(struct ckpt_ctx *ctx) +{ + int n, ret = 0; + + for (n = 0; n < ctx->nr_tasks; n++) { + ckpt_debug("dumping task #%d\n", n); + ret = checkpoint_task(ctx, ctx->tasks_arr[n]); + if (ret < 0) + break; + } + + return ret; +} + static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) { + struct task_struct *root = ctx->root_task; + + ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns)); + if (t->state == TASK_DEAD) { _ckpt_err(ctx, -EBUSY, "%(T)Task state EXIT_DEAD\n"); return -EBUSY; @@ -221,15 +240,234 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) return -EBUSY; } + /* + * FIX: for now, disallow siblings of container init created + * via CLONE_PARENT (unclear if they will remain possible) + */ + if (ctx->root_init && t != root && t->tgid != root->tgid && + t->real_parent == root->real_parent) { + _ckpt_err(ctx, -EINVAL, "%(T)Task is sibling of root\n"); + return -EINVAL; + } + + /* FIX: change this when namespaces are added */ + if (task_nsproxy(t) != ctx->root_nsproxy) + return -EPERM; + return 0; } +#define CKPT_HDR_PIDS_CHUNK 256 + +static int checkpoint_pids(struct ckpt_ctx *ctx) +{ + struct ckpt_pids *h; + struct pid_namespace *ns; + struct task_struct *task; + struct task_struct **tasks_arr; + int nr_tasks, n, pos = 0, ret = 0; + + ns = ctx->root_nsproxy->pid_ns; + tasks_arr = ctx->tasks_arr; + nr_tasks = ctx->nr_tasks; + BUG_ON(nr_tasks <= 0); + + ret = ckpt_write_obj_type(ctx, NULL, + sizeof(*h) * nr_tasks, + CKPT_HDR_BUFFER); + if (ret < 0) + return ret; + + h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK); + if (!h) + return -ENOMEM; + + do { + rcu_read_lock(); + for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) { + task = tasks_arr[pos]; + + h[n].vpid = task_pid_nr_ns(task, ns); + h[n].vtgid = task_tgid_nr_ns(task, ns); + h[n].vpgid = task_pgrp_nr_ns(task, ns); + h[n].vsid = task_session_nr_ns(task, ns); + h[n].vppid = task_tgid_nr_ns(task->real_parent, ns); + ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n", + pos, h[n].vpid, h[n].vtgid, h[n].vppid); + pos++; + } + rcu_read_unlock(); + + n = min(nr_tasks, CKPT_HDR_PIDS_CHUNK); + ret = ckpt_kwrite(ctx, h, n * sizeof(*h)); + if (ret < 0) + break; + + nr_tasks -= n; + } while (nr_tasks > 0); + + _ckpt_hdr_put(ctx, h, sizeof(*h) * CKPT_HDR_PIDS_CHUNK); + return ret; +} + +struct ckpt_cnt_tasks { + struct ckpt_ctx *ctx; + int nr; +}; + +/* count number of tasks in tree (and optionally fill pid's in array) */ +static int __tree_count_tasks(struct task_struct *task, void *data) +{ + struct ckpt_cnt_tasks *d = (struct ckpt_cnt_tasks *) data; + struct ckpt_ctx *ctx = d->ctx; + int ret; + + ctx->tsk = task; /* (for _ckpt_err()) */ + + /* is this task cool ? */ + ret = may_checkpoint_task(ctx, task); + if (ret < 0) + goto out; + + if (ctx->tasks_arr) { + if (d->nr == ctx->nr_tasks) { /* unlikely... try again later */ + _ckpt_err(ctx, -EBUSY, "%(T)Bad task count (%d)\n", + d->nr); + ret = -EBUSY; + goto out; + } + ctx->tasks_arr[d->nr++] = task; + get_task_struct(task); + } + + ret = 1; + out: + ctx->tsk = NULL; + return ret; +} + +static int tree_count_tasks(struct ckpt_ctx *ctx) +{ + struct ckpt_cnt_tasks data; + int ret; + + data.ctx = ctx; + data.nr = 0; + + ckpt_msg_lock(ctx); + ret = walk_task_subtree(ctx->root_task, __tree_count_tasks, &data); + ckpt_msg_unlock(ctx); + if (ret < 0) + _ckpt_msg_complete(ctx); + return ret; +} + +/* + * build_tree - scan the tasks tree in DFS order and fill in array + * @ctx: checkpoint context + * + * Using DFS order simplifies the restart logic to re-create the tasks. + * + * On success, ctx->tasks_arr will be allocated and populated with all + * tasks (reference taken), and ctx->nr_tasks will hold the total count. + * The array is cleaned up by ckpt_ctx_free(). + */ +static int build_tree(struct ckpt_ctx *ctx) +{ + int n, m; + + /* count tasks (no side effects) */ + n = tree_count_tasks(ctx); + if (n < 0) + return n; + + ctx->nr_tasks = n; + ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL); + if (!ctx->tasks_arr) + return -ENOMEM; + + /* count again (now will fill array) */ + m = tree_count_tasks(ctx); + + /* unlikely, but ... (cleanup in ckpt_ctx_free) */ + if (m < 0) + return m; + else if (m != n) + return -EBUSY; + + return 0; +} + +/* dump the array that describes the tasks tree */ +static int checkpoint_tree(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_tree *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TREE); + if (!h) + return -ENOMEM; + + h->nr_tasks = ctx->nr_tasks; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + return ret; + + ret = checkpoint_pids(ctx); + return ret; +} + +static struct task_struct *get_freezer_task(struct task_struct *root_task) +{ + struct task_struct *p; + + /* + * For the duration of checkpoint we deep-freeze all tasks. + * Normally do it through the root task's freezer cgroup. + * However, if the root task is also the current task (doing + * self-checkpoint) we can't freeze ourselves. In this case, + * choose the next available (non-dead) task instead. We'll + * use its freezer cgroup to verify that all tasks belong to + * the same cgroup. + */ + + if (root_task != current) { + get_task_struct(root_task); + return root_task; + } + + /* search among threads, then children */ + read_lock(&tasklist_lock); + + for (p = next_thread(root_task); p != root_task; p = next_thread(p)) { + if (p->state == TASK_DEAD) + continue; + if (!in_same_cgroup_freezer(p, root_task)) + goto out; + } + + list_for_each_entry(p, &root_task->children, sibling) { + if (p->state == TASK_DEAD) + continue; + if (!in_same_cgroup_freezer(p, root_task)) + goto out; + } + + p = NULL; + out: + read_unlock(&tasklist_lock); + if (p) + get_task_struct(p); + return p; +} + /* setup checkpoint-specific parts of ctx */ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) { struct task_struct *task; struct nsproxy *nsproxy; - int ret; /* * No need for explicit cleanup here, because if an error @@ -261,18 +499,14 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) ctx->root_nsproxy = nsproxy; /* root freezer */ - ctx->root_freezer = task; - geT_task_struct(task); + ctx->root_freezer = get_freezer_task(task); - ret = may_checkpoint_task(ctx, task); - if (ret) { - _ckpt_msg_complete(ctx); - put_task_struct(task); - put_task_struct(task); - put_nsproxy(nsproxy); - ctx->root_nsproxy = NULL; - ctx->root_task = NULL; - return ret; + /* container init ? */ + ctx->root_init = is_container_init(task); + + if (!(ctx->uflags & CHECKPOINT_SUBTREE) && !ctx->root_init) { + ckpt_err(ctx, -EINVAL, "Not container init\n"); + return -EINVAL; /* cleanup by ckpt_ctx_free() */ } return 0; @@ -288,17 +522,26 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) if (ctx->root_freezer) { ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer); - if (ret < 0) + if (ret < 0) { + ckpt_err(ctx, ret, "Freezer cgroup failed\n"); return ret; + } } + ret = build_tree(ctx); + if (ret < 0) + goto out; + ret = checkpoint_write_header(ctx); if (ret < 0) goto out; ret = checkpoint_container(ctx); if (ret < 0) goto out; - ret = checkpoint_task(ctx, ctx->root_task); + ret = checkpoint_tree(ctx); + if (ret < 0) + goto out; + ret = checkpoint_all_tasks(ctx); if (ret < 0) goto out; ret = checkpoint_write_tail(ctx); diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 360c41e..3e898e7 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -382,7 +382,7 @@ static int restore_read_header(struct ckpt_ctx *ctx) ckpt_err(ctx, ret, "incompatible kernel version"); goto out; } - if (h->uflags) { + if (h->uflags & ~CHECKPOINT_USER_FLAGS) { ckpt_err(ctx, ret, "incompatible restart user flags"); goto out; } diff --git a/checkpoint/sys.c b/checkpoint/sys.c index d858096..d0eed25 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -23,6 +23,16 @@ #include <linux/checkpoint.h> /* + * ckpt_unpriv_allowed - sysctl controlled. + * If 0, then caller of sys_checkpoint() or sys_restart() must have + * CAP_SYS_ADMIN + * If 1, then only sys_restart() requires CAP_SYS_ADMIN. + * If 2, then both can be called without privilege - regular permissions + * checks are intended to do the job. + */ +int ckpt_unpriv_allowed = 1; /* default: unpriv checkpoint not restart */ + +/* * Helpers to write(read) from(to) kernel space to(from) the checkpoint * image file descriptor (similar to how a core-dump is performed). * @@ -169,6 +179,19 @@ EXPORT_SYMBOL(ckpt_hdr_get_type); * restart operation, and persists until the operation is completed. */ +static void task_arr_free(struct ckpt_ctx *ctx) +{ + int n; + + for (n = 0; n < ctx->nr_tasks; n++) { + if (ctx->tasks_arr[n]) { + put_task_struct(ctx->tasks_arr[n]); + ctx->tasks_arr[n] = NULL; + } + } + kfree(ctx->tasks_arr); +} + static void ckpt_ctx_free(struct ckpt_ctx *ctx) { if (ctx->file) @@ -176,6 +199,9 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->logfile) fput(ctx->logfile); + if (ctx->tasks_arr) + task_arr_free(ctx); + if (ctx->root_nsproxy) put_nsproxy(ctx->root_nsproxy); if (ctx->root_task) @@ -417,6 +443,79 @@ void do_ckpt_msg(struct ckpt_ctx *ctx, int err, char *fmt, ...) } EXPORT_SYMBOL(do_ckpt_msg); +/** + * walk_task_subtree: iterate through a task's descendants + * @root: subtree root task + * @func: callback invoked on each task + * @data: pointer passed to the callback + * + * The function will start with @root, and iterate through all the + * descendants, including threads, in a DFS manner. Children of a task + * are traversed before proceeding to the next thread of that task. + * + * For each task, the callback @func will be called providing the task + * pointer and the @data. The callback is invoked while holding the + * tasklist_lock for reading. If the callback fails it should return a + * negative error, and the traversal ends. If the callback succeeds, + * it returns a non-negative number, and these values are summed. + * + * On success, walk_task_subtree() returns the total summed. On + * failure, it returns a negative value. + */ +int walk_task_subtree(struct task_struct *root, + int (*func)(struct task_struct *, void *), + void *data) +{ + + struct task_struct *leader = root; + struct task_struct *parent = NULL; + struct task_struct *task = root; + int total = 0; + int ret; + + read_lock(&tasklist_lock); + while (1) { + /* invoke callback on this task */ + ret = func(task, data); + if (ret < 0) + break; + + total += ret; + + /* if has children - proceed with child */ + if (!list_empty(&task->children)) { + parent = task; + task = list_entry(task->children.next, + struct task_struct, sibling); + continue; + } + + while (task != root) { + /* if has sibling - proceed with sibling */ + if (!list_is_last(&task->sibling, &parent->children)) { + task = list_entry(task->sibling.next, + struct task_struct, sibling); + break; + } + + /* else, trace back to parent and proceed */ + task = parent; + parent = parent->real_parent; + } + + if (task == root) { + /* in case root task is multi-threaded */ + root = task = next_thread(task); + if (root == leader) + break; + } + } + read_unlock(&tasklist_lock); + + ckpt_debug("total %d ret %d\n", total, ret); + return (ret < 0 ? ret : total); +} + /* checkpoint/restart syscalls */ /** @@ -434,10 +533,12 @@ long do_sys_checkpoint(pid_t pid, int fd, unsigned long flags, int logfd) struct ckpt_ctx *ctx; long ret; - /* no flags for now */ - if (flags) + if (flags & ~CHECKPOINT_USER_FLAGS) return -EINVAL; + if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN)) + return -EPERM; + if (pid == 0) pid = task_pid_vnr(current); ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT, logfd); @@ -472,6 +573,9 @@ long do_sys_restart(pid_t pid, int fd, unsigned long flags, int logfd) if (flags) return -EINVAL; + if (ckpt_unpriv_allowed < 2 && !capable(CAP_SYS_ADMIN)) + return -EPERM; + ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART, logfd); if (IS_ERR(ctx)) return PTR_ERR(ctx); diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 8cb6130..30f5353 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -12,6 +12,9 @@ #define CHECKPOINT_VERSION 3 +/* checkpoint user flags */ +#define CHECKPOINT_SUBTREE 0x1 + /* misc user visible */ #define CHECKPOINT_FD_NONE -1 @@ -35,6 +38,13 @@ extern long do_sys_restart(pid_t pid, int fd, #define CKPT_CTX_CHECKPOINT (1 << CKPT_CTX_CHECKPOINT_BIT) #define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT) +/* ckpt_ctx: uflags */ +#define CHECKPOINT_USER_FLAGS CHECKPOINT_SUBTREE + + +extern int walk_task_subtree(struct task_struct *task, + int (*func)(struct task_struct *, void *), + void *data); extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count); extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 24e880f..083f5d3 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -65,7 +65,9 @@ enum { CKPT_HDR_STRING, #define CKPT_HDR_STRING CKPT_HDR_STRING - CKPT_HDR_TASK = 101, + CKPT_HDR_TREE = 101, +#define CKPT_HDR_TREE CKPT_HDR_TREE + CKPT_HDR_TASK, #define CKPT_HDR_TASK CKPT_HDR_TASK CKPT_HDR_RESTART_BLOCK, #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK @@ -137,6 +139,20 @@ struct ckpt_hdr_container { struct ckpt_hdr h; } __attribute__((aligned(8)));; +/* task tree */ +struct ckpt_hdr_tree { + struct ckpt_hdr h; + __s32 nr_tasks; +} __attribute__((aligned(8))); + +struct ckpt_pids { + __s32 vpid; + __s32 vppid; + __s32 vtgid; + __s32 vpgid; + __s32 vsid; +} __attribute__((aligned(8))); + /* task data */ struct ckpt_hdr_task { struct ckpt_hdr h; diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 6420a3b..a66c603 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -22,6 +22,7 @@ struct ckpt_ctx { ktime_t ktime_begin; /* checkpoint start time */ + int root_init; /* [container] root init ? */ pid_t root_pid; /* [container] root pid */ struct task_struct *root_task; /* [container] root task */ struct nsproxy *root_nsproxy; /* [container] root nsproxy */ @@ -35,6 +36,9 @@ struct ckpt_ctx { struct file *logfile; /* status/debug log file */ loff_t total; /* total read/written */ + struct task_struct **tasks_arr; /* array of all tasks in container */ + int nr_tasks; /* size of tasks array */ + struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8a68b24..8443bb0 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -204,6 +204,10 @@ int sysctl_legacy_va_layout; extern int prove_locking; extern int lock_stat; +#ifdef CONFIG_CHECKPOINT +extern int ckpt_unpriv_allowed; +#endif + /* The default sysctl tables: */ static struct ctl_table root_table[] = { @@ -936,6 +940,18 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_CHECKPOINT + { + .procname = "ckpt_unpriv_allowed", + .data = &ckpt_unpriv_allowed, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &two, + }, +#endif + /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 30/96] c/r: restart multiple processes 2010-03-17 16:08 ` [C/R v20][PATCH 29/96] c/r: checkpoint multiple processes Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 31/96] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Restarting of multiple processes expects all restarting tasks to call sys_restart(). Once inside the system call, each task will restart itself at the same order that they were saved. The internals of the syscall will take care of in-kernel synchronization bewteen tasks. This patch does _not_ create the task tree in the kernel. Instead it assumes that all tasks are created in some way and then invoke the restart syscall. You can use the userspace mktree.c program to do that. There is one special task - the coordinator - that is not part of the restarted hierarchy. The coordinator task allocates the restart context (ctx) and orchestrates the restart. Thus even if a restart fails after, or during the restore of the root task, the user perceives a clean exit and an error message. The coordinator task will: 1) read header and tree, create @ctx (wake up restarting tasks) 2) set the ->checkpoint_ctx field of itself and all descendants 3) wait for all restarting tasks to reach sync point #1 4) activate first restarting task (root task) 5) wait for all other tasks to complete and reach sync point #3 6) wake up everybody (Note that in step #2 the coordinator assumes that the entire task hierarchy exists by the time it enters sys_restart; this is arranged in user space by 'mktree') Task that are restarting has three sync points: 1) wait for its ->checkpoint_ctx to be set (by the coordinator) 2) wait for the task's turn to restore (be active) [...now the task restores its state...] 3) wait for all other tasks to complete The third sync point ensures that a task may only resume execution after all tasks have successfully restored their state (or fail if an error has occured). This prevents tasks from returning to user space prematurely, before the entire restart completes. If a single task wishes to restart, it can set the "RESTART_TASKSELF" flag to restart(2) to skip the logic of the coordinator. The root-task is a child of the coordinator, identified by the @pid given to sys_restart() in the pid-ns of the coordinator. Restarting tasks that aren't the coordinator, should set the @pid argument of restart(2) syscall to zero. All tasks explicitly test for an error flag on the checkpoint context when they wakeup from sync points. If an error occurs during the restart of some task, it will mark the @ctx with an error flag, and wakeup the other tasks. An array of pids (the one saved during the checkpoint) is used to synchronize the operation. The first task in the array is the init task (*). The restart context (@ctx) maintains a "current position" in the array, which indicates which task is currently active. Once the currently active task completes its own restart, it increments that position and wakes up the next task. Restart assumes that userspace provides meaningful data, otherwise it's garbage-in-garbage-out. In this case, the syscall may block indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or otherwise kill the stray restarting tasks. In terms of security, restart runs as the user the invokes it, so it will not allow a user to do more than is otherwise permitted by the usual system semantics and policy. Currently we ignore threads and zombies, as well as session ids. Add support for multiple processes (*) For containers, restart should be called inside a fresh container by the init task of that container. However, it is also possible to restart applications not necessarily inside a container, and without restoring the original pids of the processes (that is, provided that the application can tolerate such behavior). This is useful to allow multi-process restart of tasks not isolated inside a container, and also for debugging. Changelog[v20]: - Replace error_sem with an event completion Changelog[v19-rc3]: - Rebase to kernel 2.6.33 - Call restore_notify_error for restart (not checkpoint !) - Make kread/kwrite() abort if CKPT_CTX_ERROR is set Changelog[v19-rc1]: - [Serge Hallyn] Move init_completion(&ctx->complete) to ctx_alloc - Pull cleanup/debug code from patches zombie, pgid to here - Simplify logic of tracking restarting tasks (->ctx) - Use walk_task_subtree() to iterate through descendants - Coordinator kills descendants on failure for proper cleanup - Prepare descendants needs PTRACE_MODE_ATTACH permissions - Threads wait for entire thread group before restoring - Add debug process-tree status during restart - Fix handling of bogus pid arg to sys_restart - [Serge Hallyn] Add global section container to image format - Coordinator to report correct error on restart failure Changelog[v18]: - Fix race of prepare_descendant() with an ongoing fork() - Track and report the first error if restart fails - Tighten logic to protect against bogus pids in input - [Matt Helsley] Improve debug output from ckpt_notify_error() Changelog[v17]: - Add uflag RESTART_FROZEN to freeze tasks after restart - Fix restore_retval() and use only for restarting tasks - Coordinator converts -ERSTART... to -EINTR - Coordinator marks and sets descendants' ->checkpoint_ctx - Coordinator properly detects errors when woken up from wait - Fix race where root_task could kick start too early - Add a sync point for restarting tasks - Multiple fixes to restart logic Changelog[v14]: - Revert change to pr_debug(), back to ckpt_debug() - Discard field 'h.parent' - Check whether calls to ckpt_hbuf_get() fail Changelog[v13]: - Clear root_task->checkpoint_ctx regardless of error condition - Remove unused argument 'ctx' from do_restore_task() prototype - Remove unused member 'pids_err' from 'struct ckpt_ctx' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/checkpoint.c | 5 + checkpoint/restart.c | 759 ++++++++++++++++++++++++++++++++++++-- checkpoint/sys.c | 72 +++- include/linux/checkpoint.h | 44 +++- include/linux/checkpoint_types.h | 24 ++- include/linux/sched.h | 10 +- kernel/exit.c | 5 + kernel/fork.c | 7 + 8 files changed, 888 insertions(+), 38 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index ba566b0..1e38ae3 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -552,6 +552,11 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) ctx->crid = atomic_inc_return(&ctx_count); ret = ctx->crid; out: + if (ret < 0) + ckpt_set_error(ctx, ret); + else + ckpt_set_success(ctx); + if (ctx->root_freezer) cgroup_freezer_end_checkpoint(ctx->root_freezer); return ret; diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 3e898e7..59c4bd8 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -13,7 +13,10 @@ #include <linux/version.h> #include <linux/sched.h> +#include <linux/wait.h> #include <linux/file.h> +#include <linux/ptrace.h> +#include <linux/freezer.h> #include <linux/magic.h> #include <linux/utsname.h> #include <asm/syscall.h> @@ -21,6 +24,169 @@ #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> +#define RESTART_DBG_ROOT (1 << 0) +#define RESTART_DBG_GHOST (1 << 1) +#define RESTART_DBG_COORD (1 << 2) +#define RESTART_DBG_TASK (1 << 3) +#define RESTART_DBG_WAITING (1 << 4) +#define RESTART_DBG_RUNNING (1 << 5) +#define RESTART_DBG_EXITED (1 << 6) +#define RESTART_DBG_FAILED (1 << 7) +#define RESTART_DBG_SUCCESS (1 << 8) + +#ifdef CONFIG_CHECKPOINT_DEBUG + +/* + * Track status of restarting tasks in a list off of checkpoint_ctx. + * Print this info when the checkpoint_ctx is freed. Sample output: + * + * [3519:2:c/r:debug_task_status:207] 3 tasks registered, nr_tasks was 0 nr_total 0 + * [3519:2:c/r:debug_task_status:210] active pid was 1, ctx->errno 0 + * [3519:2:c/r:debug_task_status:212] kflags 6 uflags 0 oflags 1 + * [3519:2:c/r:debug_task_status:214] task 0 to run was 2 + * [3519:2:c/r:debug_task_status:217] pid 3517 C r + * [3519:2:c/r:debug_task_status:217] pid 3519 RN + * [3519:2:c/r:debug_task_status:217] pid 3520 G + */ + +struct ckpt_task_status { + pid_t pid; + int flags; + int error; + struct list_head list; +}; + +static int restore_debug_task(struct ckpt_ctx *ctx, int flags) +{ + struct ckpt_task_status *s; + + s = kmalloc(sizeof(*s), GFP_KERNEL); + if (!s) { + ckpt_debug("no memory to register ?!\n"); + return -ENOMEM; + } + s->pid = current->pid; + s->error = 0; + s->flags = RESTART_DBG_WAITING | flags; + if (current == ctx->root_task) + s->flags |= RESTART_DBG_ROOT; + + spin_lock(&ctx->lock); + list_add_tail(&s->list, &ctx->task_status); + spin_unlock(&ctx->lock); + + return 0; +} + +static struct ckpt_task_status *restore_debug_getme(struct ckpt_ctx *ctx) +{ + struct ckpt_task_status *s; + + spin_lock(&ctx->lock); + list_for_each_entry(s, &ctx->task_status, list) { + if (s->pid == current->pid) { + spin_unlock(&ctx->lock); + return s; + } + } + spin_unlock(&ctx->lock); + return NULL; +} + +static void restore_debug_error(struct ckpt_ctx *ctx, int err) +{ + struct ckpt_task_status *s = restore_debug_getme(ctx); + + s->error = err; + s->flags &= ~RESTART_DBG_WAITING; + s->flags &= ~RESTART_DBG_RUNNING; + if (err) + s->flags |= RESTART_DBG_FAILED; + else + s->flags |= RESTART_DBG_SUCCESS; +} + +static void restore_debug_running(struct ckpt_ctx *ctx) +{ + struct ckpt_task_status *s = restore_debug_getme(ctx); + + s->flags &= ~RESTART_DBG_WAITING; + s->flags |= RESTART_DBG_RUNNING; +} + +static void restore_debug_exit(struct ckpt_ctx *ctx) +{ + struct ckpt_task_status *s = restore_debug_getme(ctx); + + s->flags &= ~RESTART_DBG_WAITING; + s->flags |= RESTART_DBG_EXITED; +} + +void restore_debug_free(struct ckpt_ctx *ctx) +{ + struct ckpt_task_status *s, *p; + int i, count = 0; + char *which, *state; + + /* + * See how many tasks registered. Tasks which didn't reach + * sys_restart() won't have registered. So if this count is + * not the same as ctx->nr_total, that's a warning bell + */ + list_for_each_entry(s, &ctx->task_status, list) + count++; + ckpt_debug("%d tasks registered, nr_tasks was %d nr_total %d\n", + count, ctx->nr_tasks, atomic_read(&ctx->nr_total)); + + ckpt_debug("active pid was %d, ctx->errno %d\n", ctx->active_pid, + ctx->errno); + ckpt_debug("kflags %lu uflags %lu oflags %lu", ctx->kflags, + ctx->uflags, ctx->oflags); + for (i = 0; i < ctx->nr_pids; i++) + ckpt_debug("task[%d] to run %d\n", i, ctx->pids_arr[i].vpid); + + list_for_each_entry_safe(s, p, &ctx->task_status, list) { + if (s->flags & RESTART_DBG_COORD) + which = "Coord"; + else if (s->flags & RESTART_DBG_ROOT) + which = "Root"; + else if (s->flags & RESTART_DBG_GHOST) + which = "Ghost"; + else if (s->flags & RESTART_DBG_TASK) + which = "Task"; + else + which = "?????"; + if (s->flags & RESTART_DBG_WAITING) + state = "Waiting"; + else if (s->flags & RESTART_DBG_RUNNING) + state = "Running"; + else if (s->flags & RESTART_DBG_FAILED) + state = "Failed"; + else if (s->flags & RESTART_DBG_SUCCESS) + state = "Success"; + else if (s->flags & RESTART_DBG_EXITED) + state = "Exited"; + else + state = "??????"; + ckpt_debug("pid %d type %s state %s\n", s->pid, which, state); + list_del(&s->list); + kfree(s); + } +} + +#else + +static inline int restore_debug_task(struct ckpt_ctx *ctx, int flags) +{ + return 0; +} +static inline void restore_debug_error(struct ckpt_ctx *ctx, int err) {} +static inline void restore_debug_running(struct ckpt_ctx *ctx) {} +static inline void restore_debug_exit(struct ckpt_ctx *ctx) {} + +#endif /* CONFIG_CHECKPOINT_DEBUG */ + + static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h) { char *ptr; @@ -205,11 +371,16 @@ void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type) BUG_ON(!len); h = ckpt_read_obj(ctx, len, len); - if (IS_ERR(h)) + if (IS_ERR(h)) { + ckpt_err(ctx, PTR_ERR(h), "Looking for type %d in ckptfile\n", + type); return h; + } if (h->type != type) { ckpt_hdr_put(ctx, h); + ckpt_err(ctx, -EINVAL, "Next object was type %d, not %d\n", + h->type, type); h = ERR_PTR(-EINVAL); } @@ -449,6 +620,519 @@ static int restore_read_tail(struct ckpt_ctx *ctx) return ret; } +/* restore_read_tree - read the tasks tree into the checkpoint context */ +static int restore_read_tree(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_tree *h; + int size, ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE); + if (IS_ERR(h)) + return PTR_ERR(h); + + ret = -EINVAL; + if (h->nr_tasks <= 0) + goto out; + + ctx->nr_pids = h->nr_tasks; + size = sizeof(*ctx->pids_arr) * ctx->nr_pids; + if (size <= 0) /* overflow ? */ + goto out; + + ctx->pids_arr = kmalloc(size, GFP_KERNEL); + if (!ctx->pids_arr) { + ret = -ENOMEM; + goto out; + } + ret = _ckpt_read_buffer(ctx, ctx->pids_arr, size); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static inline int all_tasks_activated(struct ckpt_ctx *ctx) +{ + return (ctx->active_pid == ctx->nr_pids); +} + +static inline pid_t get_active_pid(struct ckpt_ctx *ctx) +{ + int active = ctx->active_pid; + return active >= 0 ? ctx->pids_arr[active].vpid : 0; +} + +static inline int is_task_active(struct ckpt_ctx *ctx, pid_t pid) +{ + return get_active_pid(ctx) == pid; +} + +/* + * If exiting a restart with error, then wake up all other tasks + * in the restart context. + */ +void restore_notify_error(struct ckpt_ctx *ctx) +{ + complete(&ctx->complete); + wake_up_all(&ctx->waitq); +} + +static inline struct ckpt_ctx *get_task_ctx(struct task_struct *task) +{ + struct ckpt_ctx *ctx; + + task_lock(task); + ctx = ckpt_ctx_get(task->checkpoint_ctx); + task_unlock(task); + return ctx; +} + +/* returns 0 on success, 1 otherwise */ +static int set_task_ctx(struct task_struct *task, struct ckpt_ctx *ctx) +{ + int ret; + + task_lock(task); + if (!task->checkpoint_ctx) { + task->checkpoint_ctx = ckpt_ctx_get(ctx); + ret = 0; + } else { + ckpt_debug("task %d has checkpoint_ctx\n", task_pid_vnr(task)); + ret = 1; + } + task_unlock(task); + return ret; +} + +static void clear_task_ctx(struct task_struct *task) +{ + struct ckpt_ctx *old; + + task_lock(task); + old = task->checkpoint_ctx; + task->checkpoint_ctx = NULL; + task_unlock(task); + + ckpt_debug("task %d clear checkpoint_ctx\n", task_pid_vnr(task)); + ckpt_ctx_put(old); +} + +static void restore_task_done(struct ckpt_ctx *ctx) +{ + if (atomic_dec_and_test(&ctx->nr_total)) + complete(&ctx->complete); + BUG_ON(atomic_read(&ctx->nr_total) < 0); +} + +static int restore_activate_next(struct ckpt_ctx *ctx) +{ + struct task_struct *task; + pid_t pid; + + ctx->active_pid++; + + BUG_ON(ctx->active_pid > ctx->nr_pids); + + if (!all_tasks_activated(ctx)) { + /* wake up next task in line to restore its state */ + pid = get_active_pid(ctx); + + rcu_read_lock(); + task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns); + /* target task must have same restart context */ + if (task && task->checkpoint_ctx == ctx) + wake_up_process(task); + else + task = NULL; + rcu_read_unlock(); + + if (!task) { + ckpt_err(ctx, -ESRCH, "task %d not found\n", pid); + return -ESRCH; + } + } + + return 0; +} + +static int wait_task_active(struct ckpt_ctx *ctx) +{ + pid_t pid = task_pid_vnr(current); + int ret; + + ckpt_debug("pid %d waiting\n", pid); + ret = wait_event_interruptible(ctx->waitq, + is_task_active(ctx, pid) || + ckpt_test_error(ctx)); + ckpt_debug("active %d < %d (ret %d, errno %d)\n", + ctx->active_pid, ctx->nr_pids, ret, ctx->errno); + if (ckpt_test_error(ctx)) + return ckpt_get_error(ctx); + return 0; +} + +static int wait_task_sync(struct ckpt_ctx *ctx) +{ + ckpt_debug("pid %d syncing\n", task_pid_vnr(current)); + wait_event_interruptible(ctx->waitq, ckpt_test_complete(ctx)); + ckpt_debug("task sync done (errno %d)\n", ctx->errno); + if (ckpt_test_error(ctx)) + return ckpt_get_error(ctx); + return 0; +} + +/* grabs a reference to the @ctx on success; caller should free */ +static struct ckpt_ctx *wait_checkpoint_ctx(void) +{ + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(waitq); + struct ckpt_ctx *ctx; + int ret; + + /* + * Wait for coordinator to become visible, then grab a + * reference to its restart context. + */ + ret = wait_event_interruptible(waitq, current->checkpoint_ctx); + if (ret < 0) { + ckpt_debug("wait_checkpoint_ctx: failed (%d)\n", ret); + return ERR_PTR(ret); + } + + ctx = get_task_ctx(current); + if (!ctx) { + ckpt_debug("wait_checkpoint_ctx: checkpoint_ctx missing\n"); + return ERR_PTR(-EAGAIN); + } + + return ctx; +} + +/* + * Ensure that all members of a thread group are in sys_restart before + * restoring any of them. Otherwise, restore may modify shared state + * and crash or fault a thread still in userspace, + */ +static int wait_sync_threads(void) +{ + struct task_struct *p = current; + atomic_t *count; + int nr = 0; + int ret = 0; + + if (thread_group_empty(p)) + return 0; + + count = &p->signal->restart_count; + + if (!atomic_read(count)) { + read_lock(&tasklist_lock); + for (p = next_thread(p); p != current; p = next_thread(p)) + nr++; + read_unlock(&tasklist_lock); + /* + * Testing that @count is 0 makes it unlikely that + * multiple threads get here. But if they do, then + * only one will succeed in initializing @count. + */ + atomic_cmpxchg(count, 0, nr + 1); + } + + if (atomic_dec_and_test(count)) { + read_lock(&tasklist_lock); + for (p = next_thread(p); p != current; p = next_thread(p)) + wake_up_process(p); + read_unlock(&tasklist_lock); + } else { + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(waitq); + ret = wait_event_interruptible(waitq, !atomic_read(count)); + } + + return ret; +} + +static int do_restore_task(void) +{ + struct ckpt_ctx *ctx; + int ret; + + ctx = wait_checkpoint_ctx(); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = restore_debug_task(ctx, RESTART_DBG_TASK); + if (ret < 0) + goto out; + + ret = wait_sync_threads(); + if (ret < 0) + goto out; + + /* wait for our turn, do the restore, and tell next task in line */ + ret = wait_task_active(ctx); + if (ret < 0) + goto out; + + restore_debug_running(ctx); + + ret = restore_task(ctx); + if (ret < 0) + goto out; + + restore_task_done(ctx); + ret = wait_task_sync(ctx); + out: + restore_debug_error(ctx, ret); + if (ret < 0) + ckpt_err(ctx, ret, "task restart failed\n"); + + clear_task_ctx(current); + ckpt_ctx_put(ctx); + return ret; +} + +/** + * __prepare_descendants - set ->checkpoint_ctx of a descendants + * @task: descendant task + * @data: points to the checkpoint ctx + */ +static int __prepare_descendants(struct task_struct *task, void *data) +{ + struct ckpt_ctx *ctx = (struct ckpt_ctx *) data; + + ckpt_debug("consider task %d\n", task_pid_vnr(task)); + + if (!ptrace_may_access(task, PTRACE_MODE_ATTACH)) { + ckpt_debug("stranger task %d\n", task_pid_vnr(task)); + return -EPERM; + } + + if (task_ptrace(task) & PT_PTRACED) { + ckpt_debug("ptraced task %d\n", task_pid_vnr(task)); + return -EBUSY; + } + + /* + * Set task->checkpoint_ctx of all non-zombie descendants. + * If a descendant already has a ->checkpoint_ctx, it + * must be a coordinator (for a different restart ?) so + * we fail. + * + * Note that own ancestors cannot interfere since they + * won't descend past us, as own ->checkpoint_ctx must + * already be set. + */ + if (!task->exit_state) { + if (set_task_ctx(task, ctx)) + return -EBUSY; + ckpt_debug("prepare task %d\n", task_pid_vnr(task)); + wake_up_process(task); + return 1; + } + + return 0; +} + +/** + * prepare_descendants - set ->checkpoint_ctx of all descendants + * @ctx: checkpoint context + * @root: root process for restart + * + * Called by the coodinator to set the ->checkpoint_ctx pointer of the + * root task and all its descendants. + */ +static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root) +{ + int nr_pids; + + nr_pids = walk_task_subtree(root, __prepare_descendants, ctx); + ckpt_debug("nr %d/%d\n", ctx->nr_pids, nr_pids); + if (nr_pids < 0) + return nr_pids; + + /* fail unless number of processes matches */ + if (nr_pids != ctx->nr_pids) + return -ESRCH; + + atomic_set(&ctx->nr_total, nr_pids); + return nr_pids; +} + +static int wait_all_tasks_finish(struct ckpt_ctx *ctx) +{ + int ret; + + BUG_ON(ctx->active_pid != -1); + ret = restore_activate_next(ctx); + if (ret < 0) + return ret; + + ret = wait_for_completion_interruptible(&ctx->complete); + ckpt_debug("final sync kflags %#lx (ret %d)\n", ctx->kflags, ret); + + return ret; +} + +static struct task_struct *choose_root_task(struct ckpt_ctx *ctx, pid_t pid) +{ + struct task_struct *task; + + if (ctx->uflags & RESTART_TASKSELF) { + ctx->root_pid = pid; + ctx->root_task = current; + get_task_struct(current); + return current; + } + + read_lock(&tasklist_lock); + list_for_each_entry(task, ¤t->children, sibling) { + if (task_pid_vnr(task) == pid) { + get_task_struct(task); + ctx->root_task = task; + ctx->root_pid = pid; + break; + } + } + read_unlock(&tasklist_lock); + + return ctx->root_task; +} + +/* setup restart-specific parts of ctx */ +static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid) +{ + struct nsproxy *nsproxy; + + /* + * No need for explicit cleanup here, because if an error + * occurs then ckpt_ctx_free() is eventually called. + */ + + if (!choose_root_task(ctx, pid)) + return -ESRCH; + + rcu_read_lock(); + nsproxy = task_nsproxy(ctx->root_task); + if (nsproxy) { + get_nsproxy(nsproxy); + ctx->root_nsproxy = nsproxy; + } + rcu_read_unlock(); + if (!nsproxy) + return -ESRCH; + + ctx->active_pid = -1; /* see restore_activate_next, get_active_pid */ + + return 0; +} + +static int __destroy_descendants(struct task_struct *task, void *data) +{ + struct ckpt_ctx *ctx = (struct ckpt_ctx *) data; + + if (task->checkpoint_ctx == ctx) + force_sig(SIGKILL, task); + + return 0; +} + +static void destroy_descendants(struct ckpt_ctx *ctx) +{ + walk_task_subtree(ctx->root_task, __destroy_descendants, ctx); +} + +static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid) +{ + int ret; + + ret = restore_debug_task(ctx, RESTART_DBG_COORD); + if (ret < 0) + return ret; + restore_debug_running(ctx); + + ret = restore_read_header(ctx); + ckpt_debug("restore header: %d\n", ret); + if (ret < 0) + return ret; + ret = restore_container(ctx); + ckpt_debug("restore container: %d\n", ret); + if (ret < 0) + return ret; + ret = restore_read_tree(ctx); + ckpt_debug("restore tree: %d\n", ret); + if (ret < 0) + return ret; + + if ((ctx->uflags & RESTART_TASKSELF) && ctx->nr_pids != 1) + return -EINVAL; + + ret = init_restart_ctx(ctx, pid); + if (ret < 0) + return ret; + + /* + * Populate own ->checkpoint_ctx: if an ancestor attempts to + * prepare_descendants() on us, it will fail. Furthermore, + * that ancestor won't proceed deeper to interfere with our + * descendants that are restarting. + */ + if (set_task_ctx(current, ctx)) { + /* + * We are a bad-behaving descendant: an ancestor must + * have prepare_descendants() us as part of a restart. + */ + ckpt_debug("coord already has checkpoint_ctx\n"); + return -EBUSY; + } + + /* + * From now on we are committed to the restart. If anything + * fails, we'll cleanup (that is, kill) those tasks in our + * subtree that we marked for restart - see below. + */ + + if (ctx->uflags & RESTART_TASKSELF) { + ret = restore_task(ctx); + ckpt_debug("restore task: %d\n", ret); + if (ret < 0) + goto out; + } else { + /* prepare descendants' t->checkpoint_ctx point to coord */ + ret = prepare_descendants(ctx, ctx->root_task); + ckpt_debug("restore prepare: %d\n", ret); + if (ret < 0) + goto out; + /* wait for all other tasks to complete do_restore_task() */ + ret = wait_all_tasks_finish(ctx); + ckpt_debug("restore finish: %d\n", ret); + if (ret < 0) + goto out; + } + + ret = restore_read_tail(ctx); + ckpt_debug("restore tail: %d\n", ret); + if (ret < 0) + goto out; + + if (ctx->uflags & RESTART_FROZEN) { + ret = cgroup_freezer_make_frozen(ctx->root_task); + ckpt_debug("freezing restart tasks ... %d\n", ret); + } + out: + restore_debug_error(ctx, ret); + if (ret < 0) + ckpt_err(ctx, ret, "restart failed (coordinator)\n"); + + if (ckpt_test_error(ctx)) { + destroy_descendants(ctx); + ret = ckpt_get_error(ctx); + } else { + ckpt_set_success(ctx); + wake_up_all(&ctx->waitq); + } + + clear_task_ctx(current); + return ret; +} + static long restore_retval(void) { struct pt_regs *regs = task_pt_regs(current); @@ -499,31 +1183,62 @@ static long restore_retval(void) return syscall_get_return_value(current, regs); } -/* setup restart-specific parts of ctx */ -static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid) +long do_restart(struct ckpt_ctx *ctx, pid_t pid) { - return 0; + long ret; + + if (ctx) + ret = do_restore_coord(ctx, pid); + else + ret = do_restore_task(); + + /* restart(2) isn't idempotent: should not be auto-restarted */ + if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR || + ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK) + ret = -EINTR; + + /* + * The retval from what we return to the caller when all goes + * well: this is either the retval from the original syscall + * that was interrupted during checkpoint, or the contents of + * (saved) eax if the task was in userspace. + * + * The coordinator (ctx!=NULL) is exempt: don't adjust its retval. + * But in self-restart (where RESTART_TASKSELF), the coordinator + * _itself_ is a restarting task. + */ + + if (!ctx || (ctx->uflags & RESTART_TASKSELF)) { + if (ret < 0) { + /* partial restore is undefined: terminate */ + ckpt_debug("restart err %ld, exiting\n", ret); + force_sig(SIGKILL, current); + } else { + ret = restore_retval(); + } + } + + ckpt_debug("sys_restart returns %ld\n", ret); + return ret; } -long do_restart(struct ckpt_ctx *ctx, pid_t pid) +/** + * exit_checkpoint - callback from do_exit to cleanup checkpoint state + * @tsk: terminating task + */ +void exit_checkpoint(struct task_struct *tsk) { - long ret; + struct ckpt_ctx *ctx; - ret = init_restart_ctx(ctx, pid); - if (ret < 0) - return ret; - ret = restore_read_header(ctx); - if (ret < 0) - return ret; - ret = restore_container(ctx); - if (ret < 0) - return ret; - ret = restore_task(ctx); - if (ret < 0) - return ret; - ret = restore_read_tail(ctx); - if (ret < 0) - return ret; + /* no one else will touch this, because @tsk is dead already */ + ctx = tsk->checkpoint_ctx; + + /* restarting zombies will activate next task in restart */ + if (tsk->flags & PF_RESTARTING) { + BUG_ON(ctx->active_pid == -1); + if (restore_activate_next(ctx) < 0) + pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid); + } - return restore_retval(); + ckpt_ctx_put(ctx); } diff --git a/checkpoint/sys.c b/checkpoint/sys.c index d0eed25..8b142ed 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -66,6 +66,9 @@ int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count) mm_segment_t fs; int ret; + if (ckpt_test_error(ctx)) + return ckpt_get_error(ctx); + fs = get_fs(); set_fs(KERNEL_DS); ret = _ckpt_kwrite(ctx->file, addr, count); @@ -103,6 +106,9 @@ int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count) mm_segment_t fs; int ret; + if (ckpt_test_error(ctx)) + return ckpt_get_error(ctx); + fs = get_fs(); set_fs(KERNEL_DS); ret = _ckpt_kread(ctx->file , addr, count); @@ -194,6 +200,12 @@ static void task_arr_free(struct ckpt_ctx *ctx) static void ckpt_ctx_free(struct ckpt_ctx *ctx) { + BUG_ON(atomic_read(&ctx->refcount)); + + /* per task status debugging only during restart */ + if (ctx->kflags & CKPT_CTX_RESTART) + restore_debug_free(ctx); + if (ctx->file) fput(ctx->file); if (ctx->logfile) @@ -209,6 +221,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->root_freezer) put_task_struct(ctx->root_freezer); + kfree(ctx->pids_arr); + kfree(ctx); } @@ -226,6 +240,17 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, ctx->kflags = kflags; ctx->ktime_begin = ktime_get(); + atomic_set(&ctx->refcount, 0); + init_waitqueue_head(&ctx->waitq); + init_completion(&ctx->complete); + + init_completion(&ctx->errno_sync); + +#ifdef CONFIG_CHECKPOINT_DEBUG + INIT_LIST_HEAD(&ctx->task_status); + spin_lock_init(&ctx->lock); +#endif + mutex_init(&ctx->msg_mutex); err = -EBADF; @@ -238,16 +263,43 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, if (!ctx->logfile) goto err; nolog: + atomic_inc(&ctx->refcount); return ctx; err: ckpt_ctx_free(ctx); return ERR_PTR(err); } -static void ckpt_set_error(struct ckpt_ctx *ctx, int err) +struct ckpt_ctx *ckpt_ctx_get(struct ckpt_ctx *ctx) +{ + if (ctx) + atomic_inc(&ctx->refcount); + return ctx; +} + +void ckpt_ctx_put(struct ckpt_ctx *ctx) +{ + if (ctx && atomic_dec_and_test(&ctx->refcount)) + ckpt_ctx_free(ctx); +} + +void ckpt_set_error(struct ckpt_ctx *ctx, int err) { - if (!ckpt_test_and_set_ctx_kflag(ctx, CKPT_CTX_ERROR)) + /* atomically set ctx->errno */ + if (!ckpt_test_and_set_ctx_kflag(ctx, CKPT_CTX_ERROR)) { ctx->errno = err; + /* make ctx->errno visible to all other tasks */ + complete_all(&ctx->errno_sync); + /* on restart, notify all tasks in restarting subtree */ + if (ctx->kflags & CKPT_CTX_RESTART) + restore_notify_error(ctx); + } +} + +void ckpt_set_success(struct ckpt_ctx *ctx) +{ + ckpt_set_ctx_kflag(ctx, CKPT_CTX_SUCCESS); + complete_all(&ctx->errno_sync); } /* helpers to handler log/dbg/err messages */ @@ -392,7 +444,7 @@ void _ckpt_msg_complete(struct ckpt_ctx *ctx) if (ctx->msglen <= 1) return; - if (ctx->kflags & CKPT_CTX_CHECKPOINT && ctx->errno) { + if (ctx->kflags & CKPT_CTX_CHECKPOINT && ckpt_test_error(ctx)) { ret = ckpt_write_obj_type(ctx, NULL, 0, CKPT_HDR_ERROR); if (!ret) ret = ckpt_write_string(ctx, ctx->msg, ctx->msglen); @@ -550,7 +602,7 @@ long do_sys_checkpoint(pid_t pid, int fd, unsigned long flags, int logfd) if (!ret) ret = ctx->crid; - ckpt_ctx_free(ctx); + ckpt_ctx_put(ctx); return ret; } @@ -570,24 +622,20 @@ long do_sys_restart(pid_t pid, int fd, unsigned long flags, int logfd) long ret; /* no flags for now */ - if (flags) + if (flags & ~RESTART_USER_FLAGS) return -EINVAL; if (ckpt_unpriv_allowed < 2 && !capable(CAP_SYS_ADMIN)) return -EPERM; - ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART, logfd); + if (pid) + ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART, logfd); if (IS_ERR(ctx)) return PTR_ERR(ctx); ret = do_restart(ctx, pid); - /* restart(2) isn't idempotent: can't restart syscall */ - if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR || - ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK) - ret = -EINTR; - - ckpt_ctx_free(ctx); + ckpt_ctx_put(ctx); return ret; } diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 30f5353..d1eb722 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -15,6 +15,10 @@ /* checkpoint user flags */ #define CHECKPOINT_SUBTREE 0x1 +/* restart user flags */ +#define RESTART_TASKSELF 0x1 +#define RESTART_FROZEN 0x2 + /* misc user visible */ #define CHECKPOINT_FD_NONE -1 @@ -34,17 +38,22 @@ extern long do_sys_restart(pid_t pid, int fd, /* ckpt_ctx: kflags */ #define CKPT_CTX_CHECKPOINT_BIT 0 #define CKPT_CTX_RESTART_BIT 1 +#define CKPT_CTX_SUCCESS_BIT 2 +#define CKPT_CTX_ERROR_BIT 3 #define CKPT_CTX_CHECKPOINT (1 << CKPT_CTX_CHECKPOINT_BIT) #define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT) +#define CKPT_CTX_SUCCESS (1 << CKPT_CTX_SUCCESS_BIT) +#define CKPT_CTX_ERROR (1 << CKPT_CTX_ERROR_BIT) /* ckpt_ctx: uflags */ #define CHECKPOINT_USER_FLAGS CHECKPOINT_SUBTREE - +#define RESTART_USER_FLAGS (RESTART_TASKSELF | RESTART_FROZEN) extern int walk_task_subtree(struct task_struct *task, int (*func)(struct task_struct *, void *), void *data); +extern void exit_checkpoint(struct task_struct *tsk); extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count); extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count); @@ -71,6 +80,35 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx, extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max); extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type); +/* ckpt kflags */ +#define ckpt_set_ctx_kflag(__ctx, __kflag) \ + set_bit(__kflag##_BIT, &(__ctx)->kflags) +#define ckpt_test_and_set_ctx_kflag(__ctx, __kflag) \ + test_and_set_bit(__kflag##_BIT, &(__ctx)->kflags) + +#define ckpt_test_error(ctx) \ + ((ctx)->kflags & CKPT_CTX_ERROR) +#define ckpt_test_complete(ctx) \ + ((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR)) + +extern void ckpt_set_success(struct ckpt_ctx *ctx); +extern void ckpt_set_error(struct ckpt_ctx *ctx, int err); + +static inline int ckpt_get_error(struct ckpt_ctx *ctx) +{ + /* + * We may notice CKPT_CTX_ERROR before ctx->errno is set, but + * ctx->errno_sync remains not-completed until after it's done. + */ + wait_for_completion(&ctx->errno_sync); + return ctx->errno; +} + +extern void restore_notify_error(struct ckpt_ctx *ctx); + +extern struct ckpt_ctx *ckpt_ctx_get(struct ckpt_ctx *ctx); +extern void ckpt_ctx_put(struct ckpt_ctx *ctx); + extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid); extern long do_restart(struct ckpt_ctx *ctx, pid_t pid); @@ -108,6 +146,8 @@ static inline int ckpt_validate_errno(int errno) #endif #ifdef CONFIG_CHECKPOINT_DEBUG + +extern void restore_debug_free(struct ckpt_ctx *ctx); extern unsigned long ckpt_debug_level; /* @@ -133,6 +173,8 @@ extern unsigned long ckpt_debug_level; #else +static inline void restore_debug_free(struct ckpt_ctx *ctx) {} + /* * This is deprecated */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index a66c603..afe76ad 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -16,6 +16,7 @@ #include <linux/nsproxy.h> #include <linux/fs.h> #include <linux/ktime.h> +#include <linux/wait.h> struct ckpt_ctx { int crid; /* unique checkpoint id */ @@ -36,17 +37,36 @@ struct ckpt_ctx { struct file *logfile; /* status/debug log file */ loff_t total; /* total read/written */ - struct task_struct **tasks_arr; /* array of all tasks in container */ - int nr_tasks; /* size of tasks array */ + atomic_t refcount; struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ + int errno; /* errno that caused failure */ + struct completion errno_sync; /* protect errno setting */ + + /* [multi-process checkpoint] */ + struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */ + int nr_tasks; /* size of tasks array */ + + /* [multi-process restart] */ + struct ckpt_pids *pids_arr; /* array of all pids [restart] */ + int nr_pids; /* size of pids array */ + atomic_t nr_total; /* total tasks count */ + int active_pid; /* (next) position in pids array */ + struct completion complete; /* container root and other tasks on */ + wait_queue_head_t waitq; /* start, end, and restart ordering */ + #define CKPT_MSG_LEN 1024 char fmt[CKPT_MSG_LEN]; char msg[CKPT_MSG_LEN]; int msglen; struct mutex msg_mutex; + +#ifdef CONFIG_CHECKPOINT_DEBUG + struct list_head task_status; /* list of status for each task */ + spinlock_t lock; +#endif }; #endif /* __KERNEL__ */ diff --git a/include/linux/sched.h b/include/linux/sched.h index bcc44ad..a70d7d1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -691,6 +691,10 @@ struct signal_struct { #endif int oom_adj; /* OOM kill score adjustment (bit shift) */ + +#ifdef CONFIG_CHECKPOINT + atomic_t restart_count; /* threads group restart sync */ +#endif }; /* Context switch must be unlocked if interrupts are to be enabled */ @@ -1578,6 +1582,9 @@ struct task_struct { unsigned long memsw_bytes; /* uncharged mem+swap usage */ } memcg_batch; #endif +#ifdef CONFIG_CHECKPOINT + struct ckpt_ctx *checkpoint_ctx; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ @@ -1771,6 +1778,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t * #define PF_EXITING 0x00000004 /* getting shut down */ #define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ +#define PF_RESTARTING 0x00000020 /* Process is restarting (c/r) */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ #define PF_MCE_PROCESS 0x00000080 /* process policy on mce errors */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ @@ -2272,7 +2280,7 @@ static inline int task_detached(struct task_struct *p) * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring * subscriptions and synchronises with wait4(). Also used in procfs. Also * pins the final release of task.io_context. Also protects ->cpuset and - * ->cgroup.subsys[]. + * ->cgroup.subsys[]. Also protects ->checkpoint_ctx in checkpoint/restart. * * Nests both inside and outside of read_lock(&tasklist_lock). * It must not be nested with write_lock_irq(&tasklist_lock), diff --git a/kernel/exit.c b/kernel/exit.c index 546774a..f8eb8bb 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -50,6 +50,7 @@ #include <linux/perf_event.h> #include <trace/events/sched.h> #include <linux/hw_breakpoint.h> +#include <linux/checkpoint.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -1000,6 +1001,10 @@ NORET_TYPE void do_exit(long code) if (unlikely(current->pi_state_cache)) kfree(current->pi_state_cache); #endif +#ifdef CONFIG_CHECKPOINT + if (unlikely(tsk->checkpoint_ctx)) + exit_checkpoint(tsk); +#endif /* * Make sure we are holding no locks: */ diff --git a/kernel/fork.c b/kernel/fork.c index 0f202ae..4eb8e7e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -65,6 +65,7 @@ #include <linux/perf_event.h> #include <linux/posix-timers.h> #include <linux/user-return-notifier.h> +#include <linux/checkpoint.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -1246,6 +1247,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, /* Need tasklist lock for parent etc handling! */ write_lock_irq(&tasklist_lock); +#ifdef CONFIG_CHECKPOINT + /* If parent is restarting, child should be too */ + if (unlikely(current->checkpoint_ctx)) + p->checkpoint_ctx = ckpt_ctx_get(current->checkpoint_ctx); +#endif + /* CLONE_PARENT re-uses the old parent */ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) { p->real_parent = current->real_parent; -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 31/96] c/r: introduce PF_RESTARTING, and skip notification on exit 2010-03-17 16:08 ` [C/R v20][PATCH 30/96] c/r: restart " Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 32/96] c/r: support for zombie processes Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan To restore zombie's we will create the a task, that, on its turn to run, calls do_exit(). Unlike normal tasks that exit, we need to prevent notification side effects that send signals to other processes, e.g. parent (SIGCHLD) or child tasks (per child's request). There are three main cases for such notifications: 1) do_notify_parent(): parent of a process is notified about a change in status (e.g. become zombie, reparent, etc). If parent ignores, then mark child for immediate release (skip zombie). 2) kill_orphan_pgrp(): a process group that becomes orphaned will signal stopped jobs (HUP then CONT). 3) forget_original_parent(): children of a process are signaled (per request) with p->pdeath_signal Remember that restoring signal state (for any restarting task) must complete _before_ it is allowed to resume execution, and not during the resume. Otherwise, a running task may send a signal to another task that hasn't restored yet, so the new signal will be lost soon-after. I considered two possible way to address this: 1. Add another sync point to restart: all tasks will first restore their state without signals (all signals blocked), and zombies call do_exit(). A sync point then will ensure that all zombies are gone and their effects done. Then all tasks restore their signal state (and mask), and sync (new point) again. Only then they may resume execution. The main disadvantage is the added complexity and inefficiency, for no good reason. 2. Introduce PF_RESTARTING: mark all restarting tasks with a new flag, and teach the above three notifications to skip sending the signal if theis flag is set. The main advantage is simplicity and completeness. Also, such a flag may to be useful later on. This the method implemented. Changelog [ckpt-v19-rc3]: - Rebase to kernel 2.6.33 Changelog [ckpt-v19-rc1]: - In reparent_thread() test for PF_RESTARTING on parent Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- kernel/exit.c | 6 +++++- kernel/signal.c | 4 ++++ 2 files changed, 9 insertions(+), 1 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index f8eb8bb..576576a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -300,6 +300,10 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent) struct pid *pgrp = task_pgrp(tsk); struct task_struct *ignored_task = tsk; + /* restarting zombie doesn't trigger signals */ + if (tsk->flags & PF_RESTARTING) + return; + if (!parent) /* exit: our father is in a different pgrp than * we are and we were the only connection outside. @@ -785,7 +789,7 @@ static void forget_original_parent(struct task_struct *father) BUG_ON(task_ptrace(t)); t->parent = t->real_parent; } - if (t->pdeath_signal) + if (t->pdeath_signal && !(t->flags & PF_RESTARTING)) group_send_sig_info(t->pdeath_signal, SEND_SIG_NOINFO, t); } while_each_thread(p, t); diff --git a/kernel/signal.c b/kernel/signal.c index 934ae5e..ce8d404 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1432,6 +1432,10 @@ int do_notify_parent(struct task_struct *tsk, int sig) BUG_ON(!task_ptrace(tsk) && (tsk->group_leader != tsk || !thread_group_empty(tsk))); + /* restarting zombie doesn't notify parent */ + if (tsk->flags & PF_RESTARTING) + return ret; + info.si_signo = sig; info.si_errno = 0; /* -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 32/96] c/r: support for zombie processes 2010-03-17 16:08 ` [C/R v20][PATCH 31/96] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 33/96] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan During checkpoint, a zombie processes need only save p->comm, p->state, p->exit_state, and p->exit_code. During restart, zombie processes are created like all other processes. They validate the saved exit_code restore p->comm and p->exit_code. Then they call do_exit() instead of waking up the next task in line. But before, they place the @ctx in p->checkpoint_ctx, so that only at exit time they will wake up the next task in line, and drop the reference to the @ctx. This provides the guarantee that when the coordinator's wait completes, all normal tasks completed their restart, and all zombie tasks are already zombified (as opposed to perhap only becoming a zombie). Changelog[v19-rc1]: - Simplify logic of tracking restarting tasks Changelog[v18]: - Fix leak of ckpt_ctx when restoring zombie tasks - Add a few more ckpt_write_err()s Changelog[v17]: - Validate t->exit_signal for both threads and leader - Skip zombies in most of may_checkpoint_task() - Save/restore t->pdeath_signal - Validate ->exit_signal and ->pdeath_signal Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/checkpoint.c | 10 ++++-- checkpoint/process.c | 69 +++++++++++++++++++++++++++++++++++----- checkpoint/restart.c | 22 +++++++++++-- include/linux/checkpoint.h | 1 + include/linux/checkpoint_hdr.h | 1 + 5 files changed, 89 insertions(+), 14 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 1e38ae3..ea1494d 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -218,7 +218,7 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns)); - if (t->state == TASK_DEAD) { + if (t->exit_state == EXIT_DEAD) { _ckpt_err(ctx, -EBUSY, "%(T)Task state EXIT_DEAD\n"); return -EBUSY; } @@ -228,6 +228,10 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) return -EPERM; } + /* zombies are cool (and also don't have nsproxy, below...) */ + if (t->exit_state) + return 0; + /* verify that all tasks belongs to same freezer cgroup */ if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) { _ckpt_err(ctx, -EBUSY, "%(T)Not frozen or wrong cgroup\n"); @@ -244,8 +248,8 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) * FIX: for now, disallow siblings of container init created * via CLONE_PARENT (unclear if they will remain possible) */ - if (ctx->root_init && t != root && t->tgid != root->tgid && - t->real_parent == root->real_parent) { + if (ctx->root_init && t != root && + t->real_parent == root->real_parent && t->tgid != root->tgid) { _ckpt_err(ctx, -EINVAL, "%(T)Task is sibling of root\n"); return -EINVAL; } diff --git a/checkpoint/process.c b/checkpoint/process.c index 9f2059c..c47dea1 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -35,12 +35,18 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) h->state = t->state; h->exit_state = t->exit_state; h->exit_code = t->exit_code; - h->exit_signal = t->exit_signal; - h->set_child_tid = (unsigned long) t->set_child_tid; - h->clear_child_tid = (unsigned long) t->clear_child_tid; + if (t->exit_state) { + /* zombie - skip remaining state */ + BUG_ON(t->exit_state != EXIT_ZOMBIE); + } else { + /* FIXME: save remaining relevant task_struct fields */ + h->exit_signal = t->exit_signal; + h->pdeath_signal = t->pdeath_signal; - /* FIXME: save remaining relevant task_struct fields */ + h->set_child_tid = (unsigned long) t->set_child_tid; + h->clear_child_tid = (unsigned long) t->clear_child_tid; + } ret = ckpt_write_obj(ctx, &h->h); ckpt_hdr_put(ctx, h); @@ -171,6 +177,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) ckpt_debug("task %d\n", ret); if (ret < 0) goto out; + + /* zombie - we're done here */ + if (t->exit_state) + return 0; + ret = checkpoint_thread(ctx, t); ckpt_debug("thread %d\n", ret); if (ret < 0) @@ -190,6 +201,19 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) * Restart */ +static inline int valid_exit_code(int exit_code) +{ + if (exit_code >= 0x10000) + return 0; + if (exit_code & 0xff) { + if (exit_code & ~0xff) + return 0; + if (!valid_signal(exit_code & 0xff)) + return 0; + } + return 1; +} + /* read the task_struct into the current task */ static int restore_task_struct(struct ckpt_ctx *ctx) { @@ -201,15 +225,39 @@ static int restore_task_struct(struct ckpt_ctx *ctx) if (IS_ERR(h)) return PTR_ERR(h); + ret = -EINVAL; + if (h->state == TASK_DEAD) { + if (h->exit_state != EXIT_ZOMBIE) + goto out; + if (!valid_exit_code(h->exit_code)) + goto out; + t->exit_code = h->exit_code; + } else { + if (h->exit_code) + goto out; + if ((thread_group_leader(t) && !valid_signal(h->exit_signal)) || + (!thread_group_leader(t) && h->exit_signal != -1)) + goto out; + if (!valid_signal(h->pdeath_signal)) + goto out; + + /* FIXME: restore remaining relevant task_struct fields */ + t->exit_signal = h->exit_signal; + t->pdeath_signal = h->pdeath_signal; + + t->set_child_tid = + (int __user *) (unsigned long) h->set_child_tid; + t->clear_child_tid = + (int __user *) (unsigned long) h->clear_child_tid; + } + memset(t->comm, 0, TASK_COMM_LEN); ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN); if (ret < 0) goto out; - t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid; - t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid; - - /* FIXME: restore remaining relevant task_struct fields */ + /* return 1 for zombie, 0 otherwise */ + ret = (h->state == TASK_DEAD ? 1 : 0); out: ckpt_hdr_put(ctx, h); return ret; @@ -329,6 +377,11 @@ int restore_task(struct ckpt_ctx *ctx) ckpt_debug("task %d\n", ret); if (ret < 0) goto out; + + /* zombie - we're done here */ + if (ret) + goto out; + ret = restore_thread(ctx); ckpt_debug("thread %d\n", ret); if (ret < 0) diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 59c4bd8..e2ed358 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -852,7 +852,7 @@ static int wait_sync_threads(void) static int do_restore_task(void) { struct ckpt_ctx *ctx; - int ret; + int zombie, ret; ctx = wait_checkpoint_ctx(); if (IS_ERR(ctx)) @@ -862,6 +862,8 @@ static int do_restore_task(void) if (ret < 0) goto out; + current->flags |= PF_RESTARTING; + ret = wait_sync_threads(); if (ret < 0) goto out; @@ -873,9 +875,22 @@ static int do_restore_task(void) restore_debug_running(ctx); - ret = restore_task(ctx); - if (ret < 0) + zombie = restore_task(ctx); + if (zombie < 0) { + ret = zombie; goto out; + } + + /* + * zombie: we're done here; do_exit() will notice the @ctx on + * our current->checkpoint_ctx (and our PF_RESTARTING) - it + * will call restore_activate_next() and release the @ctx. + */ + if (zombie) { + restore_debug_exit(ctx); + ckpt_ctx_put(ctx); + do_exit(current->exit_code); + } restore_task_done(ctx); ret = wait_task_sync(ctx); @@ -884,6 +899,7 @@ static int do_restore_task(void) if (ret < 0) ckpt_err(ctx, ret, "task restart failed\n"); + current->flags &= ~PF_RESTARTING; clear_task_ctx(current); ckpt_ctx_put(ctx); return ret; diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index d1eb722..61581f6 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -113,6 +113,7 @@ extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid); extern long do_restart(struct ckpt_ctx *ctx, pid_t pid); /* task */ +extern int ckpt_activate_next(struct ckpt_ctx *ctx); extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_task(struct ckpt_ctx *ctx); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 083f5d3..f85b673 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -160,6 +160,7 @@ struct ckpt_hdr_task { __u32 exit_state; __u32 exit_code; __u32 exit_signal; + __u32 pdeath_signal; __u64 set_child_tid; __u64 clear_child_tid; -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 33/96] c/r: Save and restore the [compat_]robust_list member of the task struct 2010-03-17 16:08 ` [C/R v20][PATCH 32/96] c/r: support for zombie processes Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 34/96] c/r: infrastructure for shared objects Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Matt Helsley From: Matt Helsley <matthltc@us.ibm.com> These lists record which futexes the task holds. To keep the overhead of robust futexes low the list is kept in userspace. When the task exits the kernel carefully walks these lists to recover held futexes that other tasks may be attempting to acquire with FUTEX_WAIT. Because they point to userspace memory that is saved/restored by checkpoint/restart saving the list pointers themselves is safe. While saving the pointers is safe during checkpoint, restart is tricky because the robust futex ABI contains provisions for changes based on checking the size of the list head. So we need to save the length of the list head too in order to make sure that the kernel used during restart is capable of handling that ABI. Since there is only one ABI supported at the moment taking the list head's size is simple. Should the ABI change we will need to use the same size as specified during sys_set_robust_list() and hence some new means of determining the length of this userspace structure in sys_checkpoint would be required. Rather than rewrite the logic that checks and handles the ABI we reuse sys_set_robust_list() by factoring out the body of the function and calling it during restart. Changelog [v19]: - Keep __u32s in even groups for 32-64 bit compatibility Signed-off-by: Matt Helsley <matthltc@us.ibm.com> [orenl@cs.columbia.edu: move save/restore code to checkpoint/process.c] Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/process.c | 49 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint_hdr.h | 5 ++++ include/linux/compat.h | 3 +- include/linux/futex.h | 1 + kernel/futex.c | 19 +++++++++----- kernel/futex_compat.c | 13 ++++++++-- 6 files changed, 79 insertions(+), 11 deletions(-) diff --git a/checkpoint/process.c b/checkpoint/process.c index c47dea1..f36e320 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -14,10 +14,57 @@ #include <linux/sched.h> #include <linux/posix-timers.h> #include <linux/futex.h> +#include <linux/compat.h> #include <linux/poll.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> + +#ifdef CONFIG_FUTEX +static void save_task_robust_futex_list(struct ckpt_hdr_task *h, + struct task_struct *t) +{ + /* + * These are __user pointers and thus can be saved without + * the objhash. + */ + h->robust_futex_list = (unsigned long)t->robust_list; + h->robust_futex_head_len = sizeof(*t->robust_list); +#ifdef CONFIG_COMPAT + h->compat_robust_futex_list = ptr_to_compat(t->compat_robust_list); + h->compat_robust_futex_head_len = sizeof(*t->compat_robust_list); +#endif +} + +static void restore_task_robust_futex_list(struct ckpt_hdr_task *h) +{ + /* Since we restore the memory map the address remains the same and + * this is safe. This is the same as [compat_]sys_set_robust_list() */ + if (h->robust_futex_list) { + struct robust_list_head __user *rfl; + rfl = (void __user *)(unsigned long) h->robust_futex_list; + do_set_robust_list(rfl, h->robust_futex_head_len); + } +#ifdef CONFIG_COMPAT + if (h->compat_robust_futex_list) { + struct compat_robust_list_head __user *crfl; + crfl = compat_ptr(h->compat_robust_futex_list); + do_compat_set_robust_list(crfl, h->compat_robust_futex_head_len); + } +#endif +} +#else /* !CONFIG_FUTEX */ +static inline void save_task_robust_futex_list(struct ckpt_hdr_task *h, + struct task_struct *t) +{ +} + +static inline void restore_task_robust_futex_list(struct ckpt_hdr_task *h) +{ +} +#endif /* CONFIG_FUTEX */ + + /*********************************************************************** * Checkpoint */ @@ -46,6 +93,7 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) h->set_child_tid = (unsigned long) t->set_child_tid; h->clear_child_tid = (unsigned long) t->clear_child_tid; + save_task_robust_futex_list(h, t); } ret = ckpt_write_obj(ctx, &h->h); @@ -249,6 +297,7 @@ static int restore_task_struct(struct ckpt_ctx *ctx) (int __user *) (unsigned long) h->set_child_tid; t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid; + restore_task_robust_futex_list(h); } memset(t->comm, 0, TASK_COMM_LEN); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index f85b673..651255f 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -162,6 +162,11 @@ struct ckpt_hdr_task { __u32 exit_signal; __u32 pdeath_signal; + __u32 compat_robust_futex_head_len; + __u32 compat_robust_futex_list; /* a compat __user ptr */ + __u32 robust_futex_head_len; + __u64 robust_futex_list; /* a __user ptr */ + __u64 set_child_tid; __u64 clear_child_tid; } __attribute__((aligned(8))); diff --git a/include/linux/compat.h b/include/linux/compat.h index ef68119..50ef270 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -209,7 +209,8 @@ struct compat_robust_list_head { }; extern void compat_exit_robust_list(struct task_struct *curr); - +extern long do_compat_set_robust_list(struct compat_robust_list_head __user *head, + compat_size_t len); asmlinkage long compat_sys_set_robust_list(struct compat_robust_list_head __user *head, compat_size_t len); diff --git a/include/linux/futex.h b/include/linux/futex.h index ae755f6..c825790 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -185,6 +185,7 @@ union futex_key { #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = NULL } } #ifdef CONFIG_FUTEX +extern long do_set_robust_list(struct robust_list_head __user *head, size_t len); extern void exit_robust_list(struct task_struct *curr); extern void exit_pi_state_list(struct task_struct *curr); extern int futex_cmpxchg_enabled; diff --git a/kernel/futex.c b/kernel/futex.c index 23419c9..baaecb4 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2342,13 +2342,7 @@ out: * the list. There can only be one such pending lock. */ -/** - * sys_set_robust_list() - Set the robust-futex list head of a task - * @head: pointer to the list-head - * @len: length of the list-head, as userspace expects - */ -SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, - size_t, len) +long do_set_robust_list(struct robust_list_head __user *head, size_t len) { if (!futex_cmpxchg_enabled) return -ENOSYS; @@ -2364,6 +2358,17 @@ SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, } /** + * sys_set_robust_list() - Set the robust-futex list head of a task + * @head: pointer to the list-head + * @len: length of the list-head, as userspace expects + */ +SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head, + size_t, len) +{ + return do_set_robust_list(head, len); +} + +/** * sys_get_robust_list() - Get the robust-futex list head of a task * @pid: pid of the process [zero for current task] * @head_ptr: pointer to a list-head pointer, the kernel fills it in diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c index 2357165..5e1a169 100644 --- a/kernel/futex_compat.c +++ b/kernel/futex_compat.c @@ -114,9 +114,9 @@ void compat_exit_robust_list(struct task_struct *curr) } } -asmlinkage long -compat_sys_set_robust_list(struct compat_robust_list_head __user *head, - compat_size_t len) +long +do_compat_set_robust_list(struct compat_robust_list_head __user *head, + compat_size_t len) { if (!futex_cmpxchg_enabled) return -ENOSYS; @@ -130,6 +130,13 @@ compat_sys_set_robust_list(struct compat_robust_list_head __user *head, } asmlinkage long +compat_sys_set_robust_list(struct compat_robust_list_head __user *head, + compat_size_t len) +{ + return do_compat_set_robust_list(head, len); +} + +asmlinkage long compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr, compat_size_t __user *len_ptr) { -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 34/96] c/r: infrastructure for shared objects 2010-03-17 16:08 ` [C/R v20][PATCH 33/96] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 35/96] c/r: detect resource leaks for whole-container checkpoint Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan The state of shared objects is saved once. On the first encounter, the state is dumped and the object is assigned a unique identifier (objref) and also stored in a hash table (indexed by its physical kernel address). >From then on the object will be found in the hash and only its identifier is saved. On restart the identifier is looked up in the hash table; if not found then the state is read, the object is created, and added to the hash table (this time indexed by its identifier). Otherwise, the object in the hash table is used. The hash is "one-way": objects added to it are never deleted until the hash it discarded. The hash is discarded at the end of checkpoint or restart, whether successful or not. The hash keeps a reference to every object that is added to it, matching the object's type, and maintains this reference during its lifetime. Therefore, it is always safe to use an object that is stored in the hash. Changelog[v20]: - Export key symbols to enable c/r from kernel modules - Avoid crash if incoming object doesn't have .restore Changelog[v19-rc1]: - Define ckpt_obj_try_fetch - Disallow zero or negative objref during restart - [Matt Helsley] Add cpp definitions for enums - [Serge Hallyn] Use ckpt_err() in ckpt_obj_fetch() - [Serge Hallyn] Use ckpt_err() in ckpt_read_obj_type() - Factor out objref handling from {_,}ckpt_read_obj() Changelog[v18]: - Add ckpt_obj_reserve() - Change ref_drop() to accept a @lastref argument (useful for cleanup) - Disallow multiple objects with same objref in restart - Allow _ckpt_read_obj_type() to read object header only (w/o payload) Changelog[v17]: - Add ckpt_obj->flags with CKPT_OBJ_CHECKPOINTED flag - Add prototype of ckpt_obj_lookup - Complain on attempt to add NULL ptr to objhash - Prepare for 'leaks detection' Changelog[v16]: - Introduce ckpt_obj_lookup() to find an object by its ptr Changelog[v14]: - Introduce 'struct ckpt_obj_ops' to better modularize shared objs. - Replace long 'switch' statements with table lookups and callbacks. - Introduce checkpoint_obj() and restart_obj() helpers - Shared objects now dumped/saved right before they are referenced - Cleanup interface of shared objects Changelog[v13]: - Use hash_long() with 'unsigned long' cast to support 64bit archs (Nathan Lynch <ntl@pobox.com>) Changelog[v11]: - Doc: be explicit about grabbing a reference and object lifetime Changelog[v4]: - Fix calculation of hash table size Changelog[v3]: - Use standard hlist_... for hash table Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/Makefile | 1 + checkpoint/objhash.c | 462 ++++++++++++++++++++++++++++++++++++++ checkpoint/restart.c | 81 +++++-- checkpoint/sys.c | 7 + include/linux/checkpoint.h | 20 ++ include/linux/checkpoint_hdr.h | 17 ++ include/linux/checkpoint_types.h | 2 + 7 files changed, 572 insertions(+), 18 deletions(-) create mode 100644 checkpoint/objhash.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 99364cc..5aa6a75 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -4,6 +4,7 @@ obj-$(CONFIG_CHECKPOINT) += \ sys.o \ + objhash.o \ checkpoint.o \ restart.o \ process.o diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c new file mode 100644 index 0000000..ada5113 --- /dev/null +++ b/checkpoint/objhash.c @@ -0,0 +1,462 @@ +/* + * Checkpoint-restart - object hash infrastructure to manage shared objects + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DOBJ + +#include <linux/kernel.h> +#include <linux/hash.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +struct ckpt_obj; +struct ckpt_obj_ops; + +/* object operations */ +struct ckpt_obj_ops { + char *obj_name; + enum obj_type obj_type; + void (*ref_drop)(void *ptr, int lastref); + int (*ref_grab)(void *ptr); + int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr); + void *(*restore)(struct ckpt_ctx *ctx); +}; + +struct ckpt_obj { + int objref; + int flags; + void *ptr; + struct ckpt_obj_ops *ops; + struct hlist_node hash; +}; + +/* object internal flags */ +#define CKPT_OBJ_CHECKPOINTED 0x1 /* object already checkpointed */ + +struct ckpt_obj_hash { + struct hlist_head *head; + int next_free_objref; +}; + +/* helper grab/drop functions: */ + +static void obj_no_drop(void *ptr, int lastref) +{ + return; +} + +static int obj_no_grab(void *ptr) +{ + return 0; +} + +static struct ckpt_obj_ops ckpt_obj_ops[] = { + /* ignored object */ + { + .obj_name = "IGNORED", + .obj_type = CKPT_OBJ_IGNORE, + .ref_drop = obj_no_drop, + .ref_grab = obj_no_grab, + }, +}; + + +#define CKPT_OBJ_HASH_NBITS 10 +#define CKPT_OBJ_HASH_TOTAL (1UL << CKPT_OBJ_HASH_NBITS) + +static void obj_hash_clear(struct ckpt_obj_hash *obj_hash) +{ + struct hlist_head *h = obj_hash->head; + struct hlist_node *n, *t; + struct ckpt_obj *obj; + int i; + + for (i = 0; i < CKPT_OBJ_HASH_TOTAL; i++) { + hlist_for_each_entry_safe(obj, n, t, &h[i], hash) { + obj->ops->ref_drop(obj->ptr, 1); + kfree(obj); + } + } +} + +void ckpt_obj_hash_free(struct ckpt_ctx *ctx) +{ + struct ckpt_obj_hash *obj_hash = ctx->obj_hash; + + if (obj_hash) { + obj_hash_clear(obj_hash); + kfree(obj_hash->head); + kfree(ctx->obj_hash); + ctx->obj_hash = NULL; + } +} + +int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx) +{ + struct ckpt_obj_hash *obj_hash; + struct hlist_head *head; + + obj_hash = kzalloc(sizeof(*obj_hash), GFP_KERNEL); + if (!obj_hash) + return -ENOMEM; + head = kzalloc(CKPT_OBJ_HASH_TOTAL * sizeof(*head), GFP_KERNEL); + if (!head) { + kfree(obj_hash); + return -ENOMEM; + } + + obj_hash->head = head; + obj_hash->next_free_objref = 1; + + ctx->obj_hash = obj_hash; + return 0; +} + +static struct ckpt_obj *obj_find_by_ptr(struct ckpt_ctx *ctx, void *ptr) +{ + struct hlist_head *h; + struct hlist_node *n; + struct ckpt_obj *obj; + + h = &ctx->obj_hash->head[hash_long((unsigned long) ptr, + CKPT_OBJ_HASH_NBITS)]; + hlist_for_each_entry(obj, n, h, hash) + if (obj->ptr == ptr) + return obj; + return NULL; +} + +static struct ckpt_obj *obj_find_by_objref(struct ckpt_ctx *ctx, int objref) +{ + struct hlist_head *h; + struct hlist_node *n; + struct ckpt_obj *obj; + + h = &ctx->obj_hash->head[hash_long((unsigned long) objref, + CKPT_OBJ_HASH_NBITS)]; + hlist_for_each_entry(obj, n, h, hash) + if (obj->objref == objref) + return obj; + return NULL; +} + +static inline int obj_alloc_objref(struct ckpt_ctx *ctx) +{ + return ctx->obj_hash->next_free_objref++; +} + +/** + * ckpt_obj_new - add an object to the obj_hash + * @ctx: checkpoint context + * @ptr: pointer to object + * @objref: object unique id + * @ops: object operations + * + * Add the object to the obj_hash. If @objref is zero, assign a unique + * object id and use @ptr as a hash key [checkpoint]. Else use @objref + * as a key [restart]. + */ +static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr, + int objref, enum obj_type type) +{ + struct ckpt_obj_ops *ops = &ckpt_obj_ops[type]; + struct ckpt_obj *obj; + int i, ret; + + /* explicitly disallow null pointers */ + BUG_ON(!ptr); + /* make sure we don't change this accidentally */ + BUG_ON(ops->obj_type != type); + + obj = kzalloc(sizeof(*obj), GFP_KERNEL); + if (!obj) + return ERR_PTR(-ENOMEM); + + obj->ptr = ptr; + obj->ops = ops; + + if (!objref) { + /* use @obj->ptr to index, assign objref (checkpoint) */ + obj->objref = obj_alloc_objref(ctx); + i = hash_long((unsigned long) ptr, CKPT_OBJ_HASH_NBITS); + } else { + /* use @obj->objref to index (restart) */ + obj->objref = objref; + i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS); + } + + ret = ops->ref_grab(obj->ptr); + if (ret < 0) { + kfree(obj); + obj = ERR_PTR(ret); + } else { + hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]); + } + + return obj; +} + +/************************************************************************** + * Checkpoint + */ + +/** + * obj_lookup_add - lookup object and add if not in objhash + * @ctx: checkpoint context + * @ptr: pointer to object + * @type: object type + * @first: [output] first encounter (added to table) + * + * Look up the object pointed to by @ptr in the hash table. If it isn't + * already found there, add the object, and allocate a unique object + * id. Grab a reference to every object that is added, and maintain the + * reference until the entire hash is freed. + */ +static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr, + enum obj_type type, int *first) +{ + struct ckpt_obj *obj; + + obj = obj_find_by_ptr(ctx, ptr); + if (!obj) { + obj = obj_new(ctx, ptr, 0, type); + *first = 1; + } else { + BUG_ON(obj->ops->obj_type != type); + *first = 0; + } + return obj; +} + +/** + * ckpt_obj_lookup - lookup object (by pointer) in objhash + * @ctx: checkpoint context + * @ptr: pointer to object + * @type: object type + * + * [used during checkpoint]. + * Return: objref (or zero if not found) + */ +int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type) +{ + struct ckpt_obj *obj; + + obj = obj_find_by_ptr(ctx, ptr); + BUG_ON(obj && obj->ops->obj_type != type); + if (obj) + ckpt_debug("%s objref %d\n", obj->ops->obj_name, obj->objref); + return obj ? obj->objref : 0; +} +EXPORT_SYMBOL(ckpt_obj_lookup); + +/** + * ckpt_obj_lookup_add - lookup object and add if not in objhash + * @ctx: checkpoint context + * @ptr: pointer to object + * @type: object type + * @first: [output] first encoutner (added to table) + * + * [used during checkpoint]. + * Return: objref + */ +int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr, + enum obj_type type, int *first) +{ + struct ckpt_obj *obj; + + obj = obj_lookup_add(ctx, ptr, type, first); + if (IS_ERR(obj)) + return PTR_ERR(obj); + ckpt_debug("%s objref %d first %d\n", + obj->ops->obj_name, obj->objref, *first); + obj->flags |= CKPT_OBJ_CHECKPOINTED; + return obj->objref; +} +EXPORT_SYMBOL(ckpt_obj_lookup_add); + +/** + * ckpt_obj_reserve - reserve an objref + * @ctx: checkpoint context + * + * The reserved objref will not be used for subsequent objects. This + * gives an objref that can be safely used during restart without a + * matching object in checkpoint. [used during checkpoint]. + */ +int ckpt_obj_reserve(struct ckpt_ctx *ctx) +{ + return obj_alloc_objref(ctx); +} +EXPORT_SYMBOL(ckpt_obj_reserve); + +/** + * checkpoint_obj - if not already in hash, add object and checkpoint + * @ctx: checkpoint context + * @ptr: pointer to object + * @type: object type + * + * Use obj_lookup_add() to lookup (and possibly add) the object to the + * hash table. If the CKPT_OBJ_CHECKPOINTED flag isn't set, then also + * save the object's state using its ops->checkpoint(). + * + * [This is used during checkpoint]. + * Returns: objref + */ +int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type) +{ + struct ckpt_hdr_objref *h; + struct ckpt_obj *obj; + int new, ret = 0; + + obj = obj_lookup_add(ctx, ptr, type, &new); + if (IS_ERR(obj)) + return PTR_ERR(obj); + + if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) { + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF); + if (!h) + return -ENOMEM; + + h->objtype = type; + h->objref = obj->objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + if (ret < 0) + return ret; + + /* invoke callback to actually dump the state */ + if (obj->ops->checkpoint) + ret = obj->ops->checkpoint(ctx, ptr); + + obj->flags |= CKPT_OBJ_CHECKPOINTED; + } + return (ret < 0 ? ret : obj->objref); +} +EXPORT_SYMBOL(checkpoint_obj); + +/************************************************************************** + * Restart + */ + +/** + * restore_obj - read in and restore a (first seen) shared object + * @ctx: checkpoint context + * @h: ckpt_hdr of shared object + * + * Read in the header payload (struct ckpt_hdr_objref). Lookup the + * object to verify it isn't there. Then restore the object's state + * and add it to the objash. No need to explicitly grab a reference - + * we hold the initial instance of this object. (Object maintained + * until the entire hash is free). + * + * [This is used during restart]. + */ +int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h) +{ + struct ckpt_obj_ops *ops; + struct ckpt_obj *obj; + void *ptr = ERR_PTR(-EINVAL); + + ckpt_debug("len %d ref %d type %d\n", h->h.len, h->objref, h->objtype); + if (h->objtype >= CKPT_OBJ_MAX) + return -EINVAL; + if (h->objref <= 0) + return -EINVAL; + + ops = &ckpt_obj_ops[h->objtype]; + BUG_ON(ops->obj_type != h->objtype); + + if (ops->restore) + ptr = ops->restore(ctx); + if (IS_ERR(ptr)) + return PTR_ERR(ptr); + + if (obj_find_by_objref(ctx, h->objref)) + obj = ERR_PTR(-EINVAL); + else + obj = obj_new(ctx, ptr, h->objref, h->objtype); + /* + * Drop an extra reference to the object returned by ops->restore: + * On success, this clears the extra reference taken by obj_new(), + * and on failure, this cleans up the object itself. + */ + ops->ref_drop(ptr, 0); + if (IS_ERR(obj)) { + ops->ref_drop(ptr, 1); + return PTR_ERR(obj); + } + return obj->objref; +} + +/** + * ckpt_obj_insert - add an object with a given objref to obj_hash + * @ctx: checkpoint context + * @ptr: pointer to object + * @objref: unique object id + * @type: object type + * + * Add the object pointer to by @ptr and identified by unique object id + * @objref to the hash table (indexed by @objref). Grab a reference to + * every object added, and maintain it until the entire hash is freed. + * + * [This is used during restart]. + */ +int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, + int objref, enum obj_type type) +{ + struct ckpt_obj *obj; + + if (objref <= 0) + return -EINVAL; + if (obj_find_by_objref(ctx, objref)) + return -EINVAL; + obj = obj_new(ctx, ptr, objref, type); + if (IS_ERR(obj)) + return PTR_ERR(obj); + ckpt_debug("%s objref %d\n", obj->ops->obj_name, objref); + return obj->objref; +} +EXPORT_SYMBOL(ckpt_obj_insert); + +/** + * ckpt_obj_try_fetch - fetch an object by its identifier + * @ctx: checkpoint context + * @objref: object id + * @type: object type + * + * Lookup the objref identifier by @objref in the hash table. Return + * an error not found. + * + * [This is used during restart]. + */ +void *ckpt_obj_try_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type) +{ + struct ckpt_obj *obj; + + obj = obj_find_by_objref(ctx, objref); + if (!obj) + return ERR_PTR(-EINVAL); + ckpt_debug("%s ref %d\n", obj->ops->obj_name, obj->objref); + if (obj->ops->obj_type == type) + return obj->ptr; + return ERR_PTR(-ENOMSG); +} +EXPORT_SYMBOL(ckpt_obj_try_fetch); + +void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type) +{ + void *ret = ckpt_obj_try_fetch(ctx, objref, type); + + if (unlikely(IS_ERR(ret))) + ckpt_err(ctx, PTR_ERR(ret), "%(O)Fetching object (type %d)\n", + objref, type); + return ret; +} +EXPORT_SYMBOL(ckpt_obj_fetch); diff --git a/checkpoint/restart.c b/checkpoint/restart.c index e2ed358..d33b18a 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -210,6 +210,63 @@ static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h) } /** + * _ckpt_read_objref - dispatch handling of a shared object + * @ctx: checkpoint context + * @hh: objrect descriptor + */ +static int _ckpt_read_objref(struct ckpt_ctx *ctx, struct ckpt_hdr *hh) +{ + struct ckpt_hdr *h; + int ret; + + h = ckpt_hdr_get(ctx, hh->len); + if (!h) + return -ENOMEM; + + *h = *hh; /* yay ! */ + + _ckpt_debug(CKPT_DOBJ, "shared len %d type %d\n", h->len, h->type); + ret = ckpt_kread(ctx, (h + 1), hh->len - sizeof(struct ckpt_hdr)); + if (ret < 0) + goto out; + + ret = restore_obj(ctx, (struct ckpt_hdr_objref *) h); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +/** + * ckpt_read_obj_dispatch - dispatch ERRORs and OBJREFs; don't return them + * @ctx: checkpoint context + * @h: desired ckpt_hdr + */ +static int ckpt_read_obj_dispatch(struct ckpt_ctx *ctx, struct ckpt_hdr *h) +{ + int ret; + + while (1) { + ret = ckpt_kread(ctx, h, sizeof(*h)); + if (ret < 0) + return ret; + _ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len); + if (h->len < sizeof(*h)) + return -EINVAL; + + if (h->type == CKPT_HDR_ERROR) { + ret = _ckpt_read_err(ctx, h); + if (ret < 0) + return ret; + } else if (h->type == CKPT_HDR_OBJREF) { + ret = _ckpt_read_objref(ctx, h); + if (ret < 0) + return ret; + } else + return 0; + } +} + +/** * _ckpt_read_obj - read an object (ckpt_hdr followed by payload) * @ctx: checkpoint context * @h: desired ckpt_hdr @@ -224,21 +281,11 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h, { int ret; - again: - ret = ckpt_kread(ctx, h, sizeof(*h)); + ret = ckpt_read_obj_dispatch(ctx, h); if (ret < 0) return ret; _ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n", h->type, h->len, len, max); - if (h->len < sizeof(*h)) - return -EINVAL; - - if (h->type == CKPT_HDR_ERROR) { - ret = _ckpt_read_err(ctx, h); - if (ret < 0) - return ret; - goto again; - } /* if len specified, enforce, else if maximum specified, enforce */ if ((len && h->len != len) || (!len && max && h->len > max)) @@ -330,13 +377,12 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max) struct ckpt_hdr *h; int ret; - ret = ckpt_kread(ctx, &hh, sizeof(hh)); + ret = ckpt_read_obj_dispatch(ctx, &hh); if (ret < 0) return ERR_PTR(ret); _ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n", hh.type, hh.len, len, max); - if (hh.len < sizeof(*h)) - return ERR_PTR(-EINVAL); + /* if len specified, enforce, else if maximum specified, enforce */ if ((len && hh.len != len) || (!len && max && hh.len > max)) return ERR_PTR(-EINVAL); @@ -372,15 +418,14 @@ void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type) h = ckpt_read_obj(ctx, len, len); if (IS_ERR(h)) { - ckpt_err(ctx, PTR_ERR(h), "Looking for type %d in ckptfile\n", - type); + ckpt_err(ctx, PTR_ERR(h), "Expecting to read type %d\n", type); return h; } if (h->type != type) { ckpt_hdr_put(ctx, h); - ckpt_err(ctx, -EINVAL, "Next object was type %d, not %d\n", - h->type, type); + ckpt_err(ctx, -EINVAL, "Expected type %d but got %d\n", + h->type, type); h = ERR_PTR(-EINVAL); } diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 8b142ed..926c937 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -211,6 +211,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->logfile) fput(ctx->logfile); + ckpt_obj_hash_free(ctx); + if (ctx->tasks_arr) task_arr_free(ctx); @@ -262,7 +264,12 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, ctx->logfile = fget(logfd); if (!ctx->logfile) goto err; + nolog: + err = -ENOMEM; + if (ckpt_obj_hash_alloc(ctx) < 0) + goto err; + atomic_inc(&ctx->refcount); return ctx; err: diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 61581f6..da6fd36 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -106,6 +106,25 @@ static inline int ckpt_get_error(struct ckpt_ctx *ctx) extern void restore_notify_error(struct ckpt_ctx *ctx); +/* obj_hash */ +extern void ckpt_obj_hash_free(struct ckpt_ctx *ctx); +extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx); + +extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h); +extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, + enum obj_type type); +extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, + enum obj_type type); +extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr, + enum obj_type type, int *first); +extern void *ckpt_obj_try_fetch(struct ckpt_ctx *ctx, int objref, + enum obj_type type); +extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, + enum obj_type type); +extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref, + enum obj_type type); +extern int ckpt_obj_reserve(struct ckpt_ctx *ctx); + extern struct ckpt_ctx *ckpt_ctx_get(struct ckpt_ctx *ctx); extern void ckpt_ctx_put(struct ckpt_ctx *ctx); @@ -139,6 +158,7 @@ static inline int ckpt_validate_errno(int errno) #define CKPT_DBASE 0x1 /* anything */ #define CKPT_DSYS 0x2 /* generic (system) */ #define CKPT_DRW 0x4 /* image read/write */ +#define CKPT_DOBJ 0x8 /* shared objects */ #define CKPT_DDEFAULT 0xffff /* default debug level */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 651255f..cdca9e4 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -64,6 +64,8 @@ enum { #define CKPT_HDR_BUFFER CKPT_HDR_BUFFER CKPT_HDR_STRING, #define CKPT_HDR_STRING CKPT_HDR_STRING + CKPT_HDR_OBJREF, +#define CKPT_HDR_OBJREF CKPT_HDR_OBJREF CKPT_HDR_TREE = 101, #define CKPT_HDR_TREE CKPT_HDR_TREE @@ -93,6 +95,21 @@ enum { #define CKPT_ARCH_X86_64 CKPT_ARCH_X86_64 }; +/* shared objrects (objref) */ +struct ckpt_hdr_objref { + struct ckpt_hdr h; + __u32 objtype; + __s32 objref; +} __attribute__((aligned(8))); + +/* shared objects types */ +enum obj_type { + CKPT_OBJ_IGNORE = 0, +#define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE + CKPT_OBJ_MAX +#define CKPT_OBJ_MAX CKPT_OBJ_MAX +}; + /* kernel constants */ struct ckpt_const { /* task */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index afe76ad..90bbb16 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -39,6 +39,8 @@ struct ckpt_ctx { atomic_t refcount; + struct ckpt_obj_hash *obj_hash; /* repository for shared objects */ + struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 35/96] c/r: detect resource leaks for whole-container checkpoint 2010-03-17 16:08 ` [C/R v20][PATCH 34/96] c/r: infrastructure for shared objects Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 36/96] deferqueue: generic queue to defer work Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE checkpoint, return an error code if the actual objects' counts are higher, indicating leaks (references to the objects from a task not being checkpointed). The comparison of the objhash user counts to object refcounts as a basis for checking for leaks comes from Alexey's OpenVZ-based c/r patchset. "Leak detection" occurs _before_ any real state is saved, as a pre-step. This prevents races due to sharing with outside world where the sharing ceases before the leak test takes place, thus protecting the checkpoint image from inconsistencies. Once leak testing concludes, checkpoint will proceed. Because objects are already in the objhash, checkpoint_obj() cannot distinguish between the first and subsequent encounters. This is solved with a flag (CKPT_OBJ_CHECKPOINTED) per object. Two additional checks take place during checkpoint: for objects that were created during, and objects destroyed, while the leak-detection pre-step took place. (By the time this occurs part of the checkpoint image has been written out to disk, so this is purely advisory). Changelog[v20]: - Export key symbols to enable c/r from kernel modules Changelog[v18]: - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic - Replace some EAGAIN with EBUSY - Add a few more ckpt_write_err()s - Introduce CKPT_OBJ_VISITED - ckpt_obj_collect() returns objref for new objects, 0 otherwise - Rename ckpt_obj_checkpointed() to ckpt_obj_visited() - Introduce ckpt_obj_visit() to mark objects as visited - Set the CHECKPOINTED flag on objects before calling checkpoint Changelog[v17]: - Leak detection is performed in two-steps - Detect reverse-leaks (objects disappearing unexpectedly) - Skip reverse-leak detection if ops->ref_users isn't defined Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/checkpoint.c | 41 ++++++++++ checkpoint/objhash.c | 188 +++++++++++++++++++++++++++++++++++++++++++- checkpoint/process.c | 5 + include/linux/checkpoint.h | 7 ++ 4 files changed, 237 insertions(+), 4 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index ea1494d..c016a2d 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -314,6 +314,24 @@ static int checkpoint_pids(struct ckpt_ctx *ctx) return ret; } +static int collect_objects(struct ckpt_ctx *ctx) +{ + int n, ret = 0; + + for (n = 0; n < ctx->nr_tasks; n++) { + ckpt_debug("dumping task #%d\n", n); + ret = ckpt_collect_task(ctx, ctx->tasks_arr[n]); + if (ret < 0) { + ctx->tsk = ctx->tasks_arr[n]; + ckpt_err(ctx, ret, "%(T)Collect failed\n"); + ctx->tsk = NULL; + break; + } + } + + return ret; +} + struct ckpt_cnt_tasks { struct ckpt_ctx *ctx; int nr; @@ -536,6 +554,21 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) if (ret < 0) goto out; + if (!(ctx->uflags & CHECKPOINT_SUBTREE)) { + /* + * Verify that all objects are contained (no leaks): + * First collect them all into the while counting users + * and then compare to the objects' real user counts. + */ + ret = collect_objects(ctx); + if (ret < 0) + goto out; + if (!ckpt_obj_contained(ctx)) { + ret = -EBUSY; + goto out; + } + } + ret = checkpoint_write_header(ctx); if (ret < 0) goto out; @@ -548,6 +581,14 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) ret = checkpoint_all_tasks(ctx); if (ret < 0) goto out; + + /* verify that all objects were indeed visited */ + if (!ckpt_obj_visited(ctx)) { + ckpt_err(ctx, -EBUSY, "Leak: unvisited\n"); + ret = -EBUSY; + goto out; + } + ret = checkpoint_write_tail(ctx); if (ret < 0) goto out; diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index ada5113..22b1601 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -25,27 +25,32 @@ struct ckpt_obj_ops { enum obj_type obj_type; void (*ref_drop)(void *ptr, int lastref); int (*ref_grab)(void *ptr); + int (*ref_users)(void *ptr); int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr); void *(*restore)(struct ckpt_ctx *ctx); }; struct ckpt_obj { + int users; int objref; int flags; void *ptr; struct ckpt_obj_ops *ops; struct hlist_node hash; + struct hlist_node next; }; /* object internal flags */ #define CKPT_OBJ_CHECKPOINTED 0x1 /* object already checkpointed */ +#define CKPT_OBJ_VISITED 0x2 /* object already visited */ struct ckpt_obj_hash { struct hlist_head *head; + struct hlist_head list; int next_free_objref; }; -/* helper grab/drop functions: */ +/* helper grab/drop/users functions */ static void obj_no_drop(void *ptr, int lastref) { @@ -114,6 +119,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx) obj_hash->head = head; obj_hash->next_free_objref = 1; + INIT_HLIST_HEAD(&obj_hash->list); ctx->obj_hash = obj_hash; return 0; @@ -181,6 +187,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr, obj->ptr = ptr; obj->ops = ops; + obj->users = 2; /* extra reference that objhash itself takes */ if (!objref) { /* use @obj->ptr to index, assign objref (checkpoint) */ @@ -198,6 +205,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr, obj = ERR_PTR(ret); } else { hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]); + hlist_add_head(&obj->next, &ctx->obj_hash->list); } return obj; @@ -230,12 +238,35 @@ static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr, *first = 1; } else { BUG_ON(obj->ops->obj_type != type); + obj->users++; *first = 0; } return obj; } /** + * ckpt_obj_collect - collect object into objhash + * @ctx: checkpoint context + * @ptr: pointer to object + * @type: object type + * + * [used during checkpoint]. + * Return: objref if object is new, 0 otherwise, or an error + */ +int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr, enum obj_type type) +{ + struct ckpt_obj *obj; + int first; + + obj = obj_lookup_add(ctx, ptr, type, &first); + if (IS_ERR(obj)) + return PTR_ERR(obj); + ckpt_debug("%s objref %d first %d\n", + obj->ops->obj_name, obj->objref, first); + return first ? obj->objref : 0; +} + +/** * ckpt_obj_lookup - lookup object (by pointer) in objhash * @ctx: checkpoint context * @ptr: pointer to object @@ -256,6 +287,21 @@ int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type) } EXPORT_SYMBOL(ckpt_obj_lookup); +static inline int obj_reverse_leak(struct ckpt_ctx *ctx, struct ckpt_obj *obj) +{ + /* + * A "reverse" leak ? All objects should already be in the + * objhash by now. But an outside task may have created an + * object while we were collecting, which we didn't catch. + */ + if (obj->ops->ref_users && !(ctx->uflags & CHECKPOINT_SUBTREE)) { + ckpt_err(ctx, -EBUSY, "%(O)%(P)Leak: reverse added late (%s)\n", + obj->objref, obj->ptr, obj->ops->obj_name); + return -EBUSY; + } + return 0; +} + /** * ckpt_obj_lookup_add - lookup object and add if not in objhash * @ctx: checkpoint context @@ -276,7 +322,11 @@ int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr, return PTR_ERR(obj); ckpt_debug("%s objref %d first %d\n", obj->ops->obj_name, obj->objref, *first); - obj->flags |= CKPT_OBJ_CHECKPOINTED; + + if (*first && obj_reverse_leak(ctx, obj)) + return -EBUSY; + + obj->flags |= CKPT_OBJ_VISITED; return obj->objref; } EXPORT_SYMBOL(ckpt_obj_lookup_add); @@ -318,6 +368,9 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type) if (IS_ERR(obj)) return PTR_ERR(obj); + if (new && obj_reverse_leak(ctx, obj)) + return -EBUSY; + if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) { h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF); if (!h) @@ -332,15 +385,142 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type) return ret; /* invoke callback to actually dump the state */ - if (obj->ops->checkpoint) - ret = obj->ops->checkpoint(ctx, ptr); + BUG_ON(!obj->ops->checkpoint); obj->flags |= CKPT_OBJ_CHECKPOINTED; + ret = obj->ops->checkpoint(ctx, ptr); } + + obj->flags |= CKPT_OBJ_VISITED; return (ret < 0 ? ret : obj->objref); } EXPORT_SYMBOL(checkpoint_obj); +/** + * ckpt_obj_visit - mark object as visited + * @ctx: checkpoint context + * @ptr: pointer to object + * @type: object type + * + * [used during checkpoint]. + * Marks the object as visited, or fail if not found + */ +int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr, enum obj_type type) +{ + struct ckpt_obj *obj; + + obj = obj_find_by_ptr(ctx, ptr); + BUG_ON(obj && obj->ops->obj_type != type); + + if (!obj) { + if (!(ctx->uflags & CHECKPOINT_SUBTREE)) { + /* if not found report reverse leak (full container) */ + ckpt_err(ctx, -EBUSY, + "%(O)%(P)Leak: reverse unknown (%s)\n", + obj->objref, obj->ptr, obj->ops->obj_name); + return -EBUSY; + } + } else { + ckpt_debug("visit %s objref %d\n", + obj->ops->obj_name, obj->objref); + obj->flags |= CKPT_OBJ_VISITED; + } + return 0; +} +EXPORT_SYMBOL(ckpt_obj_visit); + +/* increment the 'users' count of an object */ +static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment) +{ + struct ckpt_obj *obj; + + obj = obj_find_by_ptr(ctx, ptr); + if (obj) + obj->users += increment; +} + +/* + * "Leak detection" - to guarantee a consistent checkpoint of a full + * container we verify that all resources are confined and isolated in + * that container: + * + * c/r code first walks through all tasks and collects all shared + * resources into the objhash, while counting the references to them; + * then, it compares this count to the object's real reference count, + * and if they don't match it means that an object has "leaked" to the + * outside. + * + * Otherwise, it is guaranteed that there are no references outside + * (of container). c/r code now proceeds to walk through all tasks, + * again, and checkpoints the resources. It ensures that all resources + * are already in the objhash, and that all of them are checkpointed. + * Otherwise it means that due to a race, an object was created or + * destroyed during the first walk but not accounted for. + * + * For instance, consider an outside task A that shared files_struct + * with inside task B. Then, after B's files where collected, A opens + * or closes a file, and immediately exits - before the first leak + * test is performed, such that the test passes. + */ + +/** + * ckpt_obj_contained - test if shared objects are contained in checkpoint + * @ctx: checkpoint context + * + * Loops through all objects in the table and compares the number of + * references accumulated during checkpoint, with the reference count + * reported by the kernel. + * + * Return 1 if respective counts match for all objects, 0 otherwise. + */ +int ckpt_obj_contained(struct ckpt_ctx *ctx) +{ + struct ckpt_obj *obj; + struct hlist_node *node; + + /* account for ctx->{file,logfile} (if in the table already) */ + ckpt_obj_users_inc(ctx, ctx->file, 1); + if (ctx->logfile) + ckpt_obj_users_inc(ctx, ctx->logfile, 1); + + hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) { + if (!obj->ops->ref_users) + continue; + if (obj->ops->ref_users(obj->ptr) != obj->users) { + ckpt_err(ctx, -EBUSY, + "%(O)%(P)%(S)Usage leak (%d != %d)\n", + obj->objref, obj->ptr, obj->ops->obj_name, + obj->ops->ref_users(obj->ptr), obj->users); + return 0; + } + } + + return 1; +} + +/** + * ckpt_obj_visited - test that all shared objects were visited + * @ctx: checkpoint context + * + * Return 1 if all objects where visited, 0 otherwise. + */ +int ckpt_obj_visited(struct ckpt_ctx *ctx) +{ + struct ckpt_obj *obj; + struct hlist_node *node; + + hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) { + if (!(obj->flags & CKPT_OBJ_VISITED)) { + ckpt_err(ctx, -EBUSY, + "%(O)%(P)%(S)Leak: not visited\n", + obj->objref, obj->ptr, obj->ops->obj_name); + return 0; + } + } + + return 1; +} + /************************************************************************** * Restart */ diff --git a/checkpoint/process.c b/checkpoint/process.c index f36e320..ef394a5 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -245,6 +245,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } +int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) +{ + return 0; +} + /*********************************************************************** * Restart */ diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index da6fd36..50ce8f9 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -113,6 +113,12 @@ extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx); extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h); extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type); +extern int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr, + enum obj_type type); +extern int ckpt_obj_contained(struct ckpt_ctx *ctx); +extern int ckpt_obj_visited(struct ckpt_ctx *ctx); +extern int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr, + enum obj_type type); extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type); extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr, @@ -133,6 +139,7 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid); /* task */ extern int ckpt_activate_next(struct ckpt_ctx *ctx); +extern int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t); extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_task(struct ckpt_ctx *ctx); -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 36/96] deferqueue: generic queue to defer work 2010-03-17 16:08 ` [C/R v20][PATCH 35/96] c/r: detect resource leaks for whole-container checkpoint Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan 0 siblings, 1 reply; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Add a interface to postpone an action until the end of the entire checkpoint or restart operation. This is useful when during the scan of tasks an operation cannot be performed in place, to avoid the need for a second scan. One use case is when restoring an ipc shared memory region that has been deleted (but is still attached), during restart it needs to be create, attached and then deleted. However, creation and attachment are performed in distinct locations, so deletion can not be performed on the spot. Instead, this work (delete) is deferred until later. (This example is in one of the following patches). This interface allows chronic procrastination in the kernel: deferqueue_create(void): Allocates and returns a new deferqueue. deferqueue_run(deferqueue): Executes all the pending works in the queue. Returns the number of works executed, or an error upon the first error reported by a deferred work. deferqueue_add(deferqueue, data, size, func, dtor): Enqueue a deferred work. @function is the callback function to do the work, which will be called with @data as an argument. @size tells the size of data. @dtor is a destructor callback that is invoked for deferred works remaining in the queue when the queue is destroyed. NOTE: for a given deferred work, @dtor is _not_ called if @func was already called (regardless of the return value of the latter). deferqueue_destroy(deferqueue): Free the deferqueue and any queued items while invoking the @dtor callback for each queued item. Why aren't we using the existing kernel workqueue mechanism? We need to defer to work until the end of the operation: not earlier, since we need other things to be in place; not later, to not block waiting for it. However, the workqueue schedules the work for 'some time later'. Also, the kernel workqueue may run in any task context, but we require many times that an operation be run in the context of some specific restarting task (e.g., restoring IPC state of a certain ipc_ns). Instead, this mechanism is a simple way for the c/r operation as a whole, and later a task in particular, to defer some action until later (but not arbitrarily later) _in the restore_ operation. Changelog[v19-rc1] - [Matt Helsley] Check for valid destructor before calling it Changelog[v18] - Interface to pass simple pointers as data with deferqueue Changelog[v17] - Fix deferqueue_add() function Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/Kconfig | 5 ++ include/linux/deferqueue.h | 78 +++++++++++++++++++++++++++++++ kernel/Makefile | 1 + kernel/deferqueue.c | 110 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 194 insertions(+), 0 deletions(-) create mode 100644 include/linux/deferqueue.h create mode 100644 kernel/deferqueue.c diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig index 21fc86b..4a2c845 100644 --- a/checkpoint/Kconfig +++ b/checkpoint/Kconfig @@ -2,10 +2,15 @@ # implemented the hooks for processor state etc. needed by the # core checkpoint/restart code. +config DEFERQUEUE + bool + default n + config CHECKPOINT bool "Checkpoint/restart (EXPERIMENTAL)" depends on CHECKPOINT_SUPPORT && EXPERIMENTAL depends on CGROUP_FREEZER + select DEFERQUEUE help Application checkpoint/restart is the ability to save the state of a running application so that it can later resume diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h new file mode 100644 index 0000000..ea3b620 --- /dev/null +++ b/include/linux/deferqueue.h @@ -0,0 +1,78 @@ +/* + * deferqueue.h --- deferred work queue handling for Linux. + */ + +#ifndef _LINUX_DEFERQUEUE_H +#define _LINUX_DEFERQUEUE_H + +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> + +/* + * This interface allows chronic procrastination in the kernel: + * + * deferqueue_create(void): + * Allocates and returns a new deferqueue. + * + * deferqueue_run(deferqueue): + * Executes all the pending works in the queue. Returns the number + * of works executed, or an error upon the first error reported by + * a deferred work. + * + * deferqueue_add(deferqueue, data, size, func, dtor): + * Enqueue a deferred work. @function is the callback function to + * do the work, which will be called with @data as an argument. + * @size tells the size of data. @dtor is a destructor callback + * that is invoked for deferred works remaining in the queue when + * the queue is destroyed. NOTE: for a given deferred work, @dtor + * is _not_ called if @func was already called (regardless of the + * return value of the latter). + * + * deferqueue_destroy(deferqueue): + * Free the deferqueue and any queued items while invoking the + * @dtor callback for each queued item. + * + * The following helpers are useful when @data is a simple pointer: + * + * deferqueue_add_ptr(deferqueue, ptr, func, dtor): + * Enqueue a deferred work whos data is @ptr. + * + * deferqueue_data_ptr(data): + * Convert a deferqueue @data to a void * pointer. + */ + + +typedef int (*deferqueue_func_t)(void *); + +struct deferqueue_entry { + deferqueue_func_t function; + deferqueue_func_t destructor; + struct list_head list; + char data[0]; +}; + +struct deferqueue_head { + spinlock_t lock; + struct list_head list; +}; + +struct deferqueue_head *deferqueue_create(void); +void deferqueue_destroy(struct deferqueue_head *head); +int deferqueue_add(struct deferqueue_head *head, void *data, int size, + deferqueue_func_t func, deferqueue_func_t dtor); +int deferqueue_run(struct deferqueue_head *head); + +static inline int deferqueue_add_ptr(struct deferqueue_head *head, void *ptr, + deferqueue_func_t func, + deferqueue_func_t dtor) +{ + return deferqueue_add(head, &ptr, sizeof(ptr), func, dtor); +} + +static inline void *deferqueue_data_ptr(void *data) +{ + return *((void **) data); +} + +#endif diff --git a/kernel/Makefile b/kernel/Makefile index 864ff75..3c2c303 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -24,6 +24,7 @@ CFLAGS_REMOVE_sched_clock.o = -pg CFLAGS_REMOVE_perf_event.o = -pg endif +obj-$(CONFIG_DEFERQUEUE) += deferqueue.o obj-$(CONFIG_FREEZER) += freezer.o obj-$(CONFIG_PROFILING) += profile.o obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o diff --git a/kernel/deferqueue.c b/kernel/deferqueue.c new file mode 100644 index 0000000..1204c8b --- /dev/null +++ b/kernel/deferqueue.c @@ -0,0 +1,110 @@ +/* + * Infrastructure to manage deferred work + * + * This differs from a workqueue in that the work must be deferred + * until specifically run by the caller. + * + * As the only user currently is checkpoint/restart, which has + * very simple usage, the locking is kept simple. Adding rules + * is protected by the head->lock. But deferqueue_run() is only + * called once, after all entries have been added. So it is not + * protected. Similarly, _destroy is only called once when the + * ckpt_ctx is releeased, so it is not locked or refcounted. These + * can of course be added if needed by other users. + * + * Why not use workqueue ? We need to defer work until the end of an + * operation: not earlier, since we need other things to be in place; + * not later, to not block waiting for it. However, the workqueue + * schedules the work for 'some time later'. Also, workqueue may run + * in any task context, but we require many times that an operation + * be run in the context of some specific restarting task (e.g., + * restoring IPC state of a certain ipc_ns). + * + * Instead, this mechanism is a simple way for the c/r operation as a + * whole, and later a task in particular, to defer some action until + * later (but not arbitrarily later) _in the restore_ operation. + * + * Copyright (C) 2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + * + */ + +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/deferqueue.h> + +struct deferqueue_head *deferqueue_create(void) +{ + struct deferqueue_head *h = kmalloc(sizeof(*h), GFP_KERNEL); + if (h) { + spin_lock_init(&h->lock); + INIT_LIST_HEAD(&h->list); + } + return h; +} + +void deferqueue_destroy(struct deferqueue_head *h) +{ + if (!list_empty(&h->list)) { + struct deferqueue_entry *dq, *n; + + pr_debug("%s: freeing non-empty queue\n", __func__); + list_for_each_entry_safe(dq, n, &h->list, list) { + if (dq->destructor) + dq->destructor(dq->data); + list_del(&dq->list); + kfree(dq); + } + } + kfree(h); +} + +int deferqueue_add(struct deferqueue_head *head, void *data, int size, + deferqueue_func_t func, deferqueue_func_t dtor) +{ + struct deferqueue_entry *dq; + + dq = kmalloc(sizeof(*dq) + size, GFP_KERNEL); + if (!dq) + return -ENOMEM; + + dq->function = func; + dq->destructor = dtor; + memcpy(dq->data, data, size); + + pr_debug("%s: adding work %p func %p dtor %p\n", + __func__, dq, func, dtor); + spin_lock(&head->lock); + list_add_tail(&dq->list, &head->list); + spin_unlock(&head->lock); + return 0; +} + +/* + * deferqueue_run - perform all work in the work queue + * @head: deferqueue_head from which to run + * + * returns: number of works performed, or < 0 on error + */ +int deferqueue_run(struct deferqueue_head *head) +{ + struct deferqueue_entry *dq, *n; + int nr = 0; + int ret; + + list_for_each_entry_safe(dq, n, &head->list, list) { + pr_debug("doing work %p function %p\n", dq, dq->function); + /* don't call destructor - function callback should do it */ + ret = dq->function(dq->data); + if (ret < 0) + pr_debug("wq function failed %d\n", ret); + list_del(&dq->list); + kfree(dq); + nr++; + } + + return nr; +} -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() 2010-03-17 16:08 ` [C/R v20][PATCH 36/96] deferqueue: generic queue to defer work Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan [not found] ` <1268842164-5590-38-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2010-03-17 16:08 ` Oren Laadan 0 siblings, 2 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan While we assume all normal files and directories can be checkpointed, there are, as usual in the VFS, specialized places that will always need an ability to override these defaults. Although we could do this completely in the checkpoint code, that would bitrot quickly. This adds a new 'file_operations' function for checkpointing a file. It is assumed that there should be a dirt-simple way to make something (un)checkpointable that fits in with current code. As you can see in the ext[234] patches down the road, all that we have to do to make something simple be supported is add a single "generic" f_op entry. Also adds a new 'file_operations' function for 'collecting' a file for leak-detection during full-container checkpoint. This is useful for those files that hold references to other "collectable" objects. Two examples are pty files that point to corresponding tty objects, and eventpoll files that refer to the files they are monitoring. Finally, this patch introduces vfs_fcntl() so that it can be called from restart (see patch adding restart of files). Changelog[v17] - Introduce 'collect' method Changelog[v17] - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- fs/fcntl.c | 21 +++++++++++++-------- include/linux/fs.h | 7 +++++++ 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 97e01dc..e1f02ca 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, return err; } +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) +{ + int err; + + err = security_file_fcntl(filp, cmd, arg); + if (err) + goto out; + err = do_fcntl(fd, cmd, arg, filp); + out: + return err; +} + SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) { struct file *filp; @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) if (!filp) goto out; - err = security_file_fcntl(filp, cmd, arg); - if (err) { - fput(filp); - return err; - } - - err = do_fcntl(fd, cmd, arg, filp); - + err = vfs_fcntl(fd, cmd, arg, filp); fput(filp); out: return err; diff --git a/include/linux/fs.h b/include/linux/fs.h index 6c08df2..65ebec5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -394,6 +394,7 @@ struct kstatfs; struct vm_area_struct; struct vfsmount; struct cred; +struct ckpt_ctx; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -1093,6 +1094,8 @@ struct file_lock { #include <linux/fcntl.h> +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); + extern void send_sigio(struct fown_struct *fown, int fd, int band); #ifdef CONFIG_FILE_LOCKING @@ -1504,6 +1507,8 @@ struct file_operations { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); + int (*checkpoint)(struct ckpt_ctx *, struct file *); + int (*collect)(struct ckpt_ctx *, struct file *); }; struct inode_operations { @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#define generic_file_checkpoint NULL + extern int vfs_readdir(struct file *, filldir_t, void *); extern int vfs_stat(char __user *, struct kstat *); -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
[parent not found: <1268842164-5590-38-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* [C/R v20][PATCH 38/96] c/r: dump open file descriptors [not found] ` <1268842164-5590-38-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-03-17 16:08 ` Oren Laadan 0 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ingo Molnar Dump the file table with 'struct ckpt_hdr_file_table, followed by all open file descriptors. Because the 'struct file' corresponding to an fd can be shared, they are assigned an objref and registered in the object hash. A reference to the 'file *' is kept for as long as it lives in the hash (the hash is only cleaned up at the end of the checkpoint). Also provide generic_checkpoint_file() and generic_restore_file() which is good for normal files and directories. It does not support yet unlinked files or directories. Changelog[v19]: - Fix false negative of test for unlinked files at checkpoint Changelog[v19-rc3]: - [Serge Hallyn] Rename fs_mnt to root_fs_path - [Dave Hansen] Error out on file locks and leases - [Serge Hallyn] Refuse checkpoint of file with f_owner Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Add a few more ckpt_write_err()s - [Dan Smith] Export fill_fname() as ckpt_fill_fname() - Introduce ckpt_collect_file() that also uses file->collect method - In collect_file_stabl() use retval from ckpt_obj_collect() to test for first-time-object Changelog[v17]: - Only collect sub-objects of files_struct once - Better file error debugging - Use (new) d_unlinked() Changelog[v16]: - Fix compile warning in checkpoint_bad() Changelog[v16]: - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - File objects are dumped/restored prior to the first reference - Introduce a per file-type restore() callback - Use struct file_operations->checkpoint() - Put code for generic file descriptors in a separate function - Use one CKPT_FILE_GENERIC for both regular files and dirs - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Rename: ckpt_write_files() => checkpoint_fd_table() - Rename: ckpt_write_fd_data() => checkpoint_file() - Discard field 'h->parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Discard handling of opened symlinks (there is no such thing) - ckpt_scan_fds() retries from scratch if hits size limits Changelog[v9]: - Fix a couple of leaks in ckpt_write_files() - Drop useless kfree from ckpt_scan_fds() Changelog[v8]: - initialize 'coe' to workaround gcc false warning Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/Makefile | 3 +- checkpoint/checkpoint.c | 11 + checkpoint/files.c | 444 ++++++++++++++++++++++++++++++++++++++ checkpoint/objhash.c | 52 +++++ checkpoint/process.c | 33 +++- checkpoint/sys.c | 8 + fs/locks.c | 35 +++ include/linux/checkpoint.h | 19 ++ include/linux/checkpoint_hdr.h | 59 +++++ include/linux/checkpoint_types.h | 5 + include/linux/fs.h | 10 + 11 files changed, 677 insertions(+), 2 deletions(-) create mode 100644 checkpoint/files.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 5aa6a75..1d0c058 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \ objhash.o \ checkpoint.o \ restart.o \ - process.o + process.o \ + files.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index c016a2d..2bc2495 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -18,6 +18,7 @@ #include <linux/time.h> #include <linux/fs.h> #include <linux/file.h> +#include <linux/fs_struct.h> #include <linux/dcache.h> #include <linux/mount.h> #include <linux/utsname.h> @@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) { struct task_struct *task; struct nsproxy *nsproxy; + struct fs_struct *fs; /* * No need for explicit cleanup here, because if an error @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) return -EINVAL; /* cleanup by ckpt_ctx_free() */ } + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ + task_lock(ctx->root_task); + fs = ctx->root_task->fs; + read_lock(&fs->lock); + ctx->root_fs_path = fs->root; + path_get(&ctx->root_fs_path); + read_unlock(&fs->lock); + task_unlock(ctx->root_task); + return 0; } diff --git a/checkpoint/files.c b/checkpoint/files.c new file mode 100644 index 0000000..7a57b24 --- /dev/null +++ b/checkpoint/files.c @@ -0,0 +1,444 @@ +/* + * Checkpoint file descriptors + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DFILE + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/deferqueue.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + + +/************************************************************************** + * Checkpoint + */ + +/** + * ckpt_fill_fname - return pathname of a given file + * @path: path name + * @root: relative root + * @buf: buffer for pathname + * @len: buffer length (in) and pathname length (out) + */ +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) +{ + struct path tmp = *root; + char *fname; + + BUG_ON(!buf); + spin_lock(&dcache_lock); + fname = __d_path(path, &tmp, buf, *len); + spin_unlock(&dcache_lock); + if (IS_ERR(fname)) + return fname; + *len = (buf + (*len) - fname); + /* + * FIX: if __d_path() changed these, it must have stepped out of + * init's namespace. Since currently we require a unified namespace + * within the container: simply fail. + */ + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) + fname = ERR_PTR(-EBADF); + + return fname; +} + +/** + * checkpoint_fname - write a file name + * @ctx: checkpoint context + * @path: path name + * @root: relative root + */ +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root) +{ + char *buf, *fname; + int ret, flen; + + /* + * FIXME: we can optimize and save memory (and storage) if we + * share strings (through objhash) and reference them instead + */ + + flen = PATH_MAX; + buf = kmalloc(flen, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + fname = ckpt_fill_fname(path, root, buf, &flen); + if (!IS_ERR(fname)) { + ret = ckpt_write_obj_type(ctx, fname, flen, + CKPT_HDR_FILE_NAME); + } else { + ret = PTR_ERR(fname); + ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n", + path->dentry->d_name.name); + } + + kfree(buf); + return ret; +} + +#define CKPT_DEFAULT_FDTABLE 256 /* an initial guess */ + +/** + * scan_fds - scan file table and construct array of open fds + * @files: files_struct pointer + * @fdtable: (output) array of open fds + * + * Returns the number of open fds found, and also the file table + * array via *fdtable. The caller should free the array. + * + * The caller must validate the file descriptors collected in the + * array before using them, e.g. by using fcheck_files(), in case + * the task's fdtable changes in the meantime. + */ +static int scan_fds(struct files_struct *files, int **fdtable) +{ + struct fdtable *fdt; + int *fds = NULL; + int i = 0, n = 0; + int tot = CKPT_DEFAULT_FDTABLE; + + /* + * We assume that all tasks possibly sharing the file table are + * frozen (or we are a single process and we checkpoint ourselves). + * Therefore, we can safely proceed after krealloc() from where we + * left off. Otherwise the file table may be modified by another + * task after we scan it. The behavior is this case is undefined, + * and either checkpoint or restart will likely fail. + */ + retry: + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); + if (!fds) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + for (/**/; i < fdt->max_fds; i++) { + if (!fcheck_files(files, i)) + continue; + if (n == tot) { + rcu_read_unlock(); + tot *= 2; /* won't overflow: kmalloc will fail */ + goto retry; + } + fds[n++] = i; + } + rcu_read_unlock(); + + *fdtable = fds; + return n; +} + +int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + h->f_flags = file->f_flags; + h->f_mode = file->f_mode; + h->f_pos = file->f_pos; + h->f_version = file->f_version; + + ckpt_debug("file %s credref %d", file->f_dentry->d_name.name, + h->f_credref); + + /* FIX: need also file->uid, file->gid, file->f_owner, etc */ + + return 0; +} + +int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_generic *h; + int ret; + + /* + * FIXME: when we'll add support for unlinked files/dirs, we'll + * need to distinguish between unlinked filed and unlinked dirs. + */ + if (d_unlinked(file->f_dentry)) { + ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", + file); + return -EBADF; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_GENERIC; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->common.h); + if (ret < 0) + goto out; + ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); + out: + ckpt_hdr_put(ctx, h); + return ret; +} +EXPORT_SYMBOL(generic_file_checkpoint); + +/* checkpoint callback for file pointer */ +int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) +{ + struct file *file = (struct file *) ptr; + int ret; + + if (!file->f_op || !file->f_op->checkpoint) { + ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", + file, file->f_op); + return -EBADF; + } + + ret = file->f_op->checkpoint(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); + return ret; +} + +/** + * ckpt_write_file_desc - dump the state of a given file descriptor + * @ctx: checkpoint context + * @files: files_struct pointer + * @fd: file descriptor + * + * Saves the state of the file descriptor; looks up the actual file + * pointer in the hash table, and if found saves the matching objref, + * otherwise calls ckpt_write_file to dump the file pointer too. + */ +static int checkpoint_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct ckpt_hdr_file_desc *h; + struct file *file = NULL; + struct fdtable *fdt; + int objref, ret; + int coe = 0; /* avoid gcc warning */ + pid_t pid; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); + if (!h) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) { + coe = FD_ISSET(fd, fdt->close_on_exec); + get_file(file); + } + rcu_read_unlock(); + + ret = find_locks_with_owner(file, files); + /* + * find_locks_with_owner() returns an error when there + * are no locks found, so we *want* it to return an error + * code. Its success means we have to fail the checkpoint. + */ + if (!ret) { + ret = -EBADF; + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); + goto out; + } + + /* sanity check (although this shouldn't happen) */ + ret = -EBADF; + if (!file) { + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); + goto out; + } + + /* + * TODO: Implement c/r of fowner and f_sigio. Should be + * trivial, but for now we just refuse its checkpoint + */ + pid = f_getown(file); + if (pid) { + ret = -EBUSY; + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); + goto out; + } + + /* + * if seen first time, this will add 'file' to the objhash, keep + * a reference to it, dump its state while at it. + */ + objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE); + ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe); + if (objref < 0) { + ret = objref; + goto out; + } + + h->fd_objref = objref; + h->fd_descriptor = fd; + h->fd_close_on_exec = coe; + + ret = ckpt_write_obj(ctx, &h->h); +out: + ckpt_hdr_put(ctx, h); + if (file) + fput(file); + return ret; +} + +static int do_checkpoint_file_table(struct ckpt_ctx *ctx, + struct files_struct *files) +{ + struct ckpt_hdr_file_table *h; + int *fdtable = NULL; + int nfds, n, ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE); + if (!h) + return -ENOMEM; + + nfds = scan_fds(files, &fdtable); + if (nfds < 0) { + ret = nfds; + goto out; + } + + h->fdt_nfds = nfds; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + goto out; + + ckpt_debug("nfds %d\n", nfds); + for (n = 0; n < nfds; n++) { + ret = checkpoint_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + goto out; + } + + ret = deferqueue_run(ctx->files_deferq); + ckpt_debug("files_deferq ran %d entries\n", ret); + if (ret > 0) + ret = 0; + out: + kfree(fdtable); + return ret; +} + +/* checkpoint callback for file table */ +int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_file_table(ctx, (struct files_struct *) ptr); +} + +/* checkpoint wrapper for file table */ +int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int objref; + + files = get_files_struct(t); + if (!files) + return -EBUSY; + objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE); + put_files_struct(files); + + return objref; +} + +/*********************************************************************** + * Collect + */ + +int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file) +{ + int ret; + + ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE); + if (ret <= 0) + return ret; + /* if first time for this file (ret > 0), invoke ->collect() */ + if (file->f_op->collect) + ret = file->f_op->collect(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file); + return ret; +} + +static int collect_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct fdtable *fdt; + struct file *file; + int ret; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) + get_file(file); + rcu_read_unlock(); + + if (!file) { + ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file); + return -EBUSY; + } + + ret = ckpt_collect_file(ctx, file); + fput(file); + + return ret; +} + +static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files) +{ + int *fdtable; + int nfds, n; + int ret; + + /* if already exists (ret == 0), nothing to do */ + ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE); + if (ret <= 0) + return ret; + + /* if first time for this file table (ret > 0), proceed inside */ + nfds = scan_fds(files, &fdtable); + if (nfds < 0) + return nfds; + + for (n = 0; n < nfds; n++) { + ret = collect_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + break; + } + + kfree(fdtable); + return ret; +} + +int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int ret; + + files = get_files_struct(t); + if (!files) { + ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n"); + return -EBUSY; + } + ret = collect_file_table(ctx, files); + put_files_struct(files); + + return ret; +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 22b1601..f25d130 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -13,6 +13,8 @@ #include <linux/kernel.h> #include <linux/hash.h> +#include <linux/file.h> +#include <linux/fdtable.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr) return 0; } +static int obj_file_table_grab(void *ptr) +{ + atomic_inc(&((struct files_struct *) ptr)->count); + return 0; +} + +static void obj_file_table_drop(void *ptr, int lastref) +{ + put_files_struct((struct files_struct *) ptr); +} + +static int obj_file_table_users(void *ptr) +{ + return atomic_read(&((struct files_struct *) ptr)->count); +} + +static int obj_file_grab(void *ptr) +{ + get_file((struct file *) ptr); + return 0; +} + +static void obj_file_drop(void *ptr, int lastref) +{ + fput((struct file *) ptr); +} + +static int obj_file_users(void *ptr) +{ + return atomic_long_read(&((struct file *) ptr)->f_count); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_drop = obj_no_drop, .ref_grab = obj_no_grab, }, + /* files_struct object */ + { + .obj_name = "FILE_TABLE", + .obj_type = CKPT_OBJ_FILE_TABLE, + .ref_drop = obj_file_table_drop, + .ref_grab = obj_file_table_grab, + .ref_users = obj_file_table_users, + .checkpoint = checkpoint_file_table, + }, + /* file object */ + { + .obj_name = "FILE", + .obj_type = CKPT_OBJ_FILE, + .ref_drop = obj_file_drop, + .ref_grab = obj_file_grab, + .ref_users = obj_file_users, + .checkpoint = checkpoint_file, + }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index ef394a5..adc34a2 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN); } +static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_task_objs *h; + int files_objref; + int ret; + + files_objref = checkpoint_obj_file_table(ctx, t); + ckpt_debug("files: objref %d\n", files_objref); + if (files_objref < 0) { + ckpt_err(ctx, files_objref, "%(T)files_struct\n"); + return files_objref; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS); + if (!h) + return -ENOMEM; + h->files_objref = files_objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + return ret; +} + /* dump the task_struct of a given task */ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t) { @@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) goto out; ret = checkpoint_cpu(ctx, t); ckpt_debug("cpu %d\n", ret); + if (ret < 0) + goto out; + ret = checkpoint_task_objs(ctx, t); + ckpt_debug("objs %d\n", ret); out: ctx->tsk = NULL; return ret; @@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) { - return 0; + int ret; + + ret = ckpt_collect_file_table(ctx, t); + + return ret; } /*********************************************************************** diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 926c937..30b8004 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->kflags & CKPT_CTX_RESTART) restore_debug_free(ctx); + if (ctx->files_deferq) + deferqueue_destroy(ctx->files_deferq); + if (ctx->file) fput(ctx->file); if (ctx->logfile) fput(ctx->logfile); ckpt_obj_hash_free(ctx); + path_put(&ctx->root_fs_path); if (ctx->tasks_arr) task_arr_free(ctx); @@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, if (ckpt_obj_hash_alloc(ctx) < 0) goto err; + ctx->files_deferq = deferqueue_create(); + if (!ctx->files_deferq) + goto err; + atomic_inc(&ctx->refcount); return ctx; err: diff --git a/fs/locks.c b/fs/locks.c index a8794f2..721481a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner) EXPORT_SYMBOL(locks_remove_posix); +int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + struct inode *inode = filp->f_path.dentry->d_inode; + struct file_lock **inode_fl; + int ret = -EEXIST; + + lock_kernel(); + for_each_lock(inode, inode_fl) { + struct file_lock *fl = *inode_fl; + /* + * We could use posix_same_owner() along with a 'fake' + * file_lock. But, the fake file will never have the + * same fl_lmops as the fl that we are looking for and + * posix_same_owner() would just fall back to this + * check anyway. + */ + if (IS_POSIX(fl)) { + if (fl->fl_owner == owner) { + ret = 0; + break; + } + } else if (IS_FLOCK(fl) || IS_LEASE(fl)) { + if (fl->fl_file == filp) { + ret = 0; + break; + } + } else { + WARN(1, "unknown file lock type, fl_flags: %x", + fl->fl_flags); + } + } + unlock_kernel(); + return ret; +} + /* * This function is called on the last close of an open file. */ diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 50ce8f9..d74a890 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx, extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max); extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type); +extern char *ckpt_fill_fname(struct path *path, struct path *root, + char *buf, int *len); + /* ckpt kflags */ #define ckpt_set_ctx_kflag(__ctx, __kflag) \ set_bit(__kflag##_BIT, &(__ctx)->kflags) @@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_restart_block(struct ckpt_ctx *ctx); +/* file table */ +extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, + struct task_struct *t); +extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr); + +/* files */ +extern int checkpoint_fname(struct ckpt_ctx *ctx, + struct path *path, struct path *root); +extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file); +extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr); + +extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h); + static inline int ckpt_validate_errno(int errno) { return (errno >= 0) && (errno < MAX_ERRNO); @@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno) #define CKPT_DSYS 0x2 /* generic (system) */ #define CKPT_DRW 0x4 /* image read/write */ #define CKPT_DOBJ 0x8 /* shared objects */ +#define CKPT_DFILE 0x10 /* files and filesystem */ #define CKPT_DDEFAULT 0xffff /* default debug level */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index cdca9e4..3222545 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -71,6 +71,8 @@ enum { #define CKPT_HDR_TREE CKPT_HDR_TREE CKPT_HDR_TASK, #define CKPT_HDR_TASK CKPT_HDR_TASK + CKPT_HDR_TASK_OBJS, +#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS CKPT_HDR_RESTART_BLOCK, #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK CKPT_HDR_THREAD, @@ -80,6 +82,15 @@ enum { /* 201-299: reserved for arch-dependent */ + CKPT_HDR_FILE_TABLE = 301, +#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE + CKPT_HDR_FILE_DESC, +#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC + CKPT_HDR_FILE_NAME, +#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME + CKPT_HDR_FILE, +#define CKPT_HDR_FILE CKPT_HDR_FILE + CKPT_HDR_TAIL = 9001, #define CKPT_HDR_TAIL CKPT_HDR_TAIL @@ -106,6 +117,10 @@ struct ckpt_hdr_objref { enum obj_type { CKPT_OBJ_IGNORE = 0, #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE + CKPT_OBJ_FILE_TABLE, +#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE + CKPT_OBJ_FILE, +#define CKPT_OBJ_FILE CKPT_OBJ_FILE CKPT_OBJ_MAX #define CKPT_OBJ_MAX CKPT_OBJ_MAX }; @@ -188,6 +203,12 @@ struct ckpt_hdr_task { __u64 clear_child_tid; } __attribute__((aligned(8))); +/* task's shared resources */ +struct ckpt_hdr_task_objs { + struct ckpt_hdr h; + __s32 files_objref; +} __attribute__((aligned(8))); + /* restart blocks */ struct ckpt_hdr_restart_block { struct ckpt_hdr h; @@ -220,4 +241,42 @@ enum restart_block_type { #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX }; +/* file system */ +struct ckpt_hdr_file_table { + struct ckpt_hdr h; + __s32 fdt_nfds; +} __attribute__((aligned(8))); + +/* file descriptors */ +struct ckpt_hdr_file_desc { + struct ckpt_hdr h; + __s32 fd_objref; + __s32 fd_descriptor; + __u32 fd_close_on_exec; +} __attribute__((aligned(8))); + +enum file_type { + CKPT_FILE_IGNORE = 0, +#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE + CKPT_FILE_GENERIC, +#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC + CKPT_FILE_MAX +#define CKPT_FILE_MAX CKPT_FILE_MAX +}; + +/* file objects */ +struct ckpt_hdr_file { + struct ckpt_hdr h; + __u32 f_type; + __u32 f_mode; + __u32 f_flags; + __u32 _padding; + __u64 f_pos; + __u64 f_version; +} __attribute__((aligned(8))); + +struct ckpt_hdr_file_generic { + struct ckpt_hdr_file common; +} __attribute__((aligned(8))); + #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 90bbb16..aae6755 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -14,6 +14,8 @@ #include <linux/sched.h> #include <linux/nsproxy.h> +#include <linux/list.h> +#include <linux/path.h> #include <linux/fs.h> #include <linux/ktime.h> #include <linux/wait.h> @@ -40,6 +42,9 @@ struct ckpt_ctx { atomic_t refcount; struct ckpt_obj_hash *obj_hash; /* repository for shared objects */ + struct deferqueue_head *files_deferq; /* deferred file-table work */ + + struct path root_fs_path; /* container root (FIXME) */ struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 65ebec5..7902a51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t); extern void locks_remove_flock(struct file *); extern void locks_release_private(struct file_lock *); extern void posix_test_lock(struct file *, struct file_lock *); +extern int find_locks_with_owner(struct file *filp, fl_owner_t owner); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); extern int posix_lock_file_wait(struct file *, struct file_lock *); extern int posix_unblock_lock(struct file *, struct file_lock *); @@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner) return; } +static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + return -ENOENT; +} + static inline void locks_remove_flock(struct file *filp) { return; @@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#ifdef CONFIG_CHECKPOINT +extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file); +#else #define generic_file_checkpoint NULL +#endif extern int vfs_readdir(struct file *, filldir_t, void *); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 38/96] c/r: dump open file descriptors 2010-03-17 16:08 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan @ 2010-03-17 16:08 ` Oren Laadan 2010-03-17 16:08 ` Oren Laadan 1 sibling, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Dump the file table with 'struct ckpt_hdr_file_table, followed by all open file descriptors. Because the 'struct file' corresponding to an fd can be shared, they are assigned an objref and registered in the object hash. A reference to the 'file *' is kept for as long as it lives in the hash (the hash is only cleaned up at the end of the checkpoint). Also provide generic_checkpoint_file() and generic_restore_file() which is good for normal files and directories. It does not support yet unlinked files or directories. Changelog[v19]: - Fix false negative of test for unlinked files at checkpoint Changelog[v19-rc3]: - [Serge Hallyn] Rename fs_mnt to root_fs_path - [Dave Hansen] Error out on file locks and leases - [Serge Hallyn] Refuse checkpoint of file with f_owner Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Add a few more ckpt_write_err()s - [Dan Smith] Export fill_fname() as ckpt_fill_fname() - Introduce ckpt_collect_file() that also uses file->collect method - In collect_file_stabl() use retval from ckpt_obj_collect() to test for first-time-object Changelog[v17]: - Only collect sub-objects of files_struct once - Better file error debugging - Use (new) d_unlinked() Changelog[v16]: - Fix compile warning in checkpoint_bad() Changelog[v16]: - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - File objects are dumped/restored prior to the first reference - Introduce a per file-type restore() callback - Use struct file_operations->checkpoint() - Put code for generic file descriptors in a separate function - Use one CKPT_FILE_GENERIC for both regular files and dirs - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Rename: ckpt_write_files() => checkpoint_fd_table() - Rename: ckpt_write_fd_data() => checkpoint_file() - Discard field 'h->parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Discard handling of opened symlinks (there is no such thing) - ckpt_scan_fds() retries from scratch if hits size limits Changelog[v9]: - Fix a couple of leaks in ckpt_write_files() - Drop useless kfree from ckpt_scan_fds() Changelog[v8]: - initialize 'coe' to workaround gcc false warning Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/Makefile | 3 +- checkpoint/checkpoint.c | 11 + checkpoint/files.c | 444 ++++++++++++++++++++++++++++++++++++++ checkpoint/objhash.c | 52 +++++ checkpoint/process.c | 33 +++- checkpoint/sys.c | 8 + fs/locks.c | 35 +++ include/linux/checkpoint.h | 19 ++ include/linux/checkpoint_hdr.h | 59 +++++ include/linux/checkpoint_types.h | 5 + include/linux/fs.h | 10 + 11 files changed, 677 insertions(+), 2 deletions(-) create mode 100644 checkpoint/files.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 5aa6a75..1d0c058 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \ objhash.o \ checkpoint.o \ restart.o \ - process.o + process.o \ + files.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index c016a2d..2bc2495 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -18,6 +18,7 @@ #include <linux/time.h> #include <linux/fs.h> #include <linux/file.h> +#include <linux/fs_struct.h> #include <linux/dcache.h> #include <linux/mount.h> #include <linux/utsname.h> @@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) { struct task_struct *task; struct nsproxy *nsproxy; + struct fs_struct *fs; /* * No need for explicit cleanup here, because if an error @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) return -EINVAL; /* cleanup by ckpt_ctx_free() */ } + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ + task_lock(ctx->root_task); + fs = ctx->root_task->fs; + read_lock(&fs->lock); + ctx->root_fs_path = fs->root; + path_get(&ctx->root_fs_path); + read_unlock(&fs->lock); + task_unlock(ctx->root_task); + return 0; } diff --git a/checkpoint/files.c b/checkpoint/files.c new file mode 100644 index 0000000..7a57b24 --- /dev/null +++ b/checkpoint/files.c @@ -0,0 +1,444 @@ +/* + * Checkpoint file descriptors + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DFILE + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/deferqueue.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + + +/************************************************************************** + * Checkpoint + */ + +/** + * ckpt_fill_fname - return pathname of a given file + * @path: path name + * @root: relative root + * @buf: buffer for pathname + * @len: buffer length (in) and pathname length (out) + */ +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) +{ + struct path tmp = *root; + char *fname; + + BUG_ON(!buf); + spin_lock(&dcache_lock); + fname = __d_path(path, &tmp, buf, *len); + spin_unlock(&dcache_lock); + if (IS_ERR(fname)) + return fname; + *len = (buf + (*len) - fname); + /* + * FIX: if __d_path() changed these, it must have stepped out of + * init's namespace. Since currently we require a unified namespace + * within the container: simply fail. + */ + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) + fname = ERR_PTR(-EBADF); + + return fname; +} + +/** + * checkpoint_fname - write a file name + * @ctx: checkpoint context + * @path: path name + * @root: relative root + */ +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root) +{ + char *buf, *fname; + int ret, flen; + + /* + * FIXME: we can optimize and save memory (and storage) if we + * share strings (through objhash) and reference them instead + */ + + flen = PATH_MAX; + buf = kmalloc(flen, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + fname = ckpt_fill_fname(path, root, buf, &flen); + if (!IS_ERR(fname)) { + ret = ckpt_write_obj_type(ctx, fname, flen, + CKPT_HDR_FILE_NAME); + } else { + ret = PTR_ERR(fname); + ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n", + path->dentry->d_name.name); + } + + kfree(buf); + return ret; +} + +#define CKPT_DEFAULT_FDTABLE 256 /* an initial guess */ + +/** + * scan_fds - scan file table and construct array of open fds + * @files: files_struct pointer + * @fdtable: (output) array of open fds + * + * Returns the number of open fds found, and also the file table + * array via *fdtable. The caller should free the array. + * + * The caller must validate the file descriptors collected in the + * array before using them, e.g. by using fcheck_files(), in case + * the task's fdtable changes in the meantime. + */ +static int scan_fds(struct files_struct *files, int **fdtable) +{ + struct fdtable *fdt; + int *fds = NULL; + int i = 0, n = 0; + int tot = CKPT_DEFAULT_FDTABLE; + + /* + * We assume that all tasks possibly sharing the file table are + * frozen (or we are a single process and we checkpoint ourselves). + * Therefore, we can safely proceed after krealloc() from where we + * left off. Otherwise the file table may be modified by another + * task after we scan it. The behavior is this case is undefined, + * and either checkpoint or restart will likely fail. + */ + retry: + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); + if (!fds) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + for (/**/; i < fdt->max_fds; i++) { + if (!fcheck_files(files, i)) + continue; + if (n == tot) { + rcu_read_unlock(); + tot *= 2; /* won't overflow: kmalloc will fail */ + goto retry; + } + fds[n++] = i; + } + rcu_read_unlock(); + + *fdtable = fds; + return n; +} + +int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + h->f_flags = file->f_flags; + h->f_mode = file->f_mode; + h->f_pos = file->f_pos; + h->f_version = file->f_version; + + ckpt_debug("file %s credref %d", file->f_dentry->d_name.name, + h->f_credref); + + /* FIX: need also file->uid, file->gid, file->f_owner, etc */ + + return 0; +} + +int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_generic *h; + int ret; + + /* + * FIXME: when we'll add support for unlinked files/dirs, we'll + * need to distinguish between unlinked filed and unlinked dirs. + */ + if (d_unlinked(file->f_dentry)) { + ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", + file); + return -EBADF; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_GENERIC; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->common.h); + if (ret < 0) + goto out; + ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); + out: + ckpt_hdr_put(ctx, h); + return ret; +} +EXPORT_SYMBOL(generic_file_checkpoint); + +/* checkpoint callback for file pointer */ +int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) +{ + struct file *file = (struct file *) ptr; + int ret; + + if (!file->f_op || !file->f_op->checkpoint) { + ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", + file, file->f_op); + return -EBADF; + } + + ret = file->f_op->checkpoint(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); + return ret; +} + +/** + * ckpt_write_file_desc - dump the state of a given file descriptor + * @ctx: checkpoint context + * @files: files_struct pointer + * @fd: file descriptor + * + * Saves the state of the file descriptor; looks up the actual file + * pointer in the hash table, and if found saves the matching objref, + * otherwise calls ckpt_write_file to dump the file pointer too. + */ +static int checkpoint_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct ckpt_hdr_file_desc *h; + struct file *file = NULL; + struct fdtable *fdt; + int objref, ret; + int coe = 0; /* avoid gcc warning */ + pid_t pid; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); + if (!h) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) { + coe = FD_ISSET(fd, fdt->close_on_exec); + get_file(file); + } + rcu_read_unlock(); + + ret = find_locks_with_owner(file, files); + /* + * find_locks_with_owner() returns an error when there + * are no locks found, so we *want* it to return an error + * code. Its success means we have to fail the checkpoint. + */ + if (!ret) { + ret = -EBADF; + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); + goto out; + } + + /* sanity check (although this shouldn't happen) */ + ret = -EBADF; + if (!file) { + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); + goto out; + } + + /* + * TODO: Implement c/r of fowner and f_sigio. Should be + * trivial, but for now we just refuse its checkpoint + */ + pid = f_getown(file); + if (pid) { + ret = -EBUSY; + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); + goto out; + } + + /* + * if seen first time, this will add 'file' to the objhash, keep + * a reference to it, dump its state while at it. + */ + objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE); + ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe); + if (objref < 0) { + ret = objref; + goto out; + } + + h->fd_objref = objref; + h->fd_descriptor = fd; + h->fd_close_on_exec = coe; + + ret = ckpt_write_obj(ctx, &h->h); +out: + ckpt_hdr_put(ctx, h); + if (file) + fput(file); + return ret; +} + +static int do_checkpoint_file_table(struct ckpt_ctx *ctx, + struct files_struct *files) +{ + struct ckpt_hdr_file_table *h; + int *fdtable = NULL; + int nfds, n, ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE); + if (!h) + return -ENOMEM; + + nfds = scan_fds(files, &fdtable); + if (nfds < 0) { + ret = nfds; + goto out; + } + + h->fdt_nfds = nfds; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + goto out; + + ckpt_debug("nfds %d\n", nfds); + for (n = 0; n < nfds; n++) { + ret = checkpoint_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + goto out; + } + + ret = deferqueue_run(ctx->files_deferq); + ckpt_debug("files_deferq ran %d entries\n", ret); + if (ret > 0) + ret = 0; + out: + kfree(fdtable); + return ret; +} + +/* checkpoint callback for file table */ +int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_file_table(ctx, (struct files_struct *) ptr); +} + +/* checkpoint wrapper for file table */ +int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int objref; + + files = get_files_struct(t); + if (!files) + return -EBUSY; + objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE); + put_files_struct(files); + + return objref; +} + +/*********************************************************************** + * Collect + */ + +int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file) +{ + int ret; + + ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE); + if (ret <= 0) + return ret; + /* if first time for this file (ret > 0), invoke ->collect() */ + if (file->f_op->collect) + ret = file->f_op->collect(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file); + return ret; +} + +static int collect_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct fdtable *fdt; + struct file *file; + int ret; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) + get_file(file); + rcu_read_unlock(); + + if (!file) { + ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file); + return -EBUSY; + } + + ret = ckpt_collect_file(ctx, file); + fput(file); + + return ret; +} + +static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files) +{ + int *fdtable; + int nfds, n; + int ret; + + /* if already exists (ret == 0), nothing to do */ + ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE); + if (ret <= 0) + return ret; + + /* if first time for this file table (ret > 0), proceed inside */ + nfds = scan_fds(files, &fdtable); + if (nfds < 0) + return nfds; + + for (n = 0; n < nfds; n++) { + ret = collect_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + break; + } + + kfree(fdtable); + return ret; +} + +int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int ret; + + files = get_files_struct(t); + if (!files) { + ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n"); + return -EBUSY; + } + ret = collect_file_table(ctx, files); + put_files_struct(files); + + return ret; +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 22b1601..f25d130 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -13,6 +13,8 @@ #include <linux/kernel.h> #include <linux/hash.h> +#include <linux/file.h> +#include <linux/fdtable.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr) return 0; } +static int obj_file_table_grab(void *ptr) +{ + atomic_inc(&((struct files_struct *) ptr)->count); + return 0; +} + +static void obj_file_table_drop(void *ptr, int lastref) +{ + put_files_struct((struct files_struct *) ptr); +} + +static int obj_file_table_users(void *ptr) +{ + return atomic_read(&((struct files_struct *) ptr)->count); +} + +static int obj_file_grab(void *ptr) +{ + get_file((struct file *) ptr); + return 0; +} + +static void obj_file_drop(void *ptr, int lastref) +{ + fput((struct file *) ptr); +} + +static int obj_file_users(void *ptr) +{ + return atomic_long_read(&((struct file *) ptr)->f_count); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_drop = obj_no_drop, .ref_grab = obj_no_grab, }, + /* files_struct object */ + { + .obj_name = "FILE_TABLE", + .obj_type = CKPT_OBJ_FILE_TABLE, + .ref_drop = obj_file_table_drop, + .ref_grab = obj_file_table_grab, + .ref_users = obj_file_table_users, + .checkpoint = checkpoint_file_table, + }, + /* file object */ + { + .obj_name = "FILE", + .obj_type = CKPT_OBJ_FILE, + .ref_drop = obj_file_drop, + .ref_grab = obj_file_grab, + .ref_users = obj_file_users, + .checkpoint = checkpoint_file, + }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index ef394a5..adc34a2 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN); } +static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_task_objs *h; + int files_objref; + int ret; + + files_objref = checkpoint_obj_file_table(ctx, t); + ckpt_debug("files: objref %d\n", files_objref); + if (files_objref < 0) { + ckpt_err(ctx, files_objref, "%(T)files_struct\n"); + return files_objref; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS); + if (!h) + return -ENOMEM; + h->files_objref = files_objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + return ret; +} + /* dump the task_struct of a given task */ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t) { @@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) goto out; ret = checkpoint_cpu(ctx, t); ckpt_debug("cpu %d\n", ret); + if (ret < 0) + goto out; + ret = checkpoint_task_objs(ctx, t); + ckpt_debug("objs %d\n", ret); out: ctx->tsk = NULL; return ret; @@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) { - return 0; + int ret; + + ret = ckpt_collect_file_table(ctx, t); + + return ret; } /*********************************************************************** diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 926c937..30b8004 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->kflags & CKPT_CTX_RESTART) restore_debug_free(ctx); + if (ctx->files_deferq) + deferqueue_destroy(ctx->files_deferq); + if (ctx->file) fput(ctx->file); if (ctx->logfile) fput(ctx->logfile); ckpt_obj_hash_free(ctx); + path_put(&ctx->root_fs_path); if (ctx->tasks_arr) task_arr_free(ctx); @@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, if (ckpt_obj_hash_alloc(ctx) < 0) goto err; + ctx->files_deferq = deferqueue_create(); + if (!ctx->files_deferq) + goto err; + atomic_inc(&ctx->refcount); return ctx; err: diff --git a/fs/locks.c b/fs/locks.c index a8794f2..721481a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner) EXPORT_SYMBOL(locks_remove_posix); +int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + struct inode *inode = filp->f_path.dentry->d_inode; + struct file_lock **inode_fl; + int ret = -EEXIST; + + lock_kernel(); + for_each_lock(inode, inode_fl) { + struct file_lock *fl = *inode_fl; + /* + * We could use posix_same_owner() along with a 'fake' + * file_lock. But, the fake file will never have the + * same fl_lmops as the fl that we are looking for and + * posix_same_owner() would just fall back to this + * check anyway. + */ + if (IS_POSIX(fl)) { + if (fl->fl_owner == owner) { + ret = 0; + break; + } + } else if (IS_FLOCK(fl) || IS_LEASE(fl)) { + if (fl->fl_file == filp) { + ret = 0; + break; + } + } else { + WARN(1, "unknown file lock type, fl_flags: %x", + fl->fl_flags); + } + } + unlock_kernel(); + return ret; +} + /* * This function is called on the last close of an open file. */ diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 50ce8f9..d74a890 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx, extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max); extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type); +extern char *ckpt_fill_fname(struct path *path, struct path *root, + char *buf, int *len); + /* ckpt kflags */ #define ckpt_set_ctx_kflag(__ctx, __kflag) \ set_bit(__kflag##_BIT, &(__ctx)->kflags) @@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_restart_block(struct ckpt_ctx *ctx); +/* file table */ +extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, + struct task_struct *t); +extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr); + +/* files */ +extern int checkpoint_fname(struct ckpt_ctx *ctx, + struct path *path, struct path *root); +extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file); +extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr); + +extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h); + static inline int ckpt_validate_errno(int errno) { return (errno >= 0) && (errno < MAX_ERRNO); @@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno) #define CKPT_DSYS 0x2 /* generic (system) */ #define CKPT_DRW 0x4 /* image read/write */ #define CKPT_DOBJ 0x8 /* shared objects */ +#define CKPT_DFILE 0x10 /* files and filesystem */ #define CKPT_DDEFAULT 0xffff /* default debug level */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index cdca9e4..3222545 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -71,6 +71,8 @@ enum { #define CKPT_HDR_TREE CKPT_HDR_TREE CKPT_HDR_TASK, #define CKPT_HDR_TASK CKPT_HDR_TASK + CKPT_HDR_TASK_OBJS, +#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS CKPT_HDR_RESTART_BLOCK, #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK CKPT_HDR_THREAD, @@ -80,6 +82,15 @@ enum { /* 201-299: reserved for arch-dependent */ + CKPT_HDR_FILE_TABLE = 301, +#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE + CKPT_HDR_FILE_DESC, +#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC + CKPT_HDR_FILE_NAME, +#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME + CKPT_HDR_FILE, +#define CKPT_HDR_FILE CKPT_HDR_FILE + CKPT_HDR_TAIL = 9001, #define CKPT_HDR_TAIL CKPT_HDR_TAIL @@ -106,6 +117,10 @@ struct ckpt_hdr_objref { enum obj_type { CKPT_OBJ_IGNORE = 0, #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE + CKPT_OBJ_FILE_TABLE, +#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE + CKPT_OBJ_FILE, +#define CKPT_OBJ_FILE CKPT_OBJ_FILE CKPT_OBJ_MAX #define CKPT_OBJ_MAX CKPT_OBJ_MAX }; @@ -188,6 +203,12 @@ struct ckpt_hdr_task { __u64 clear_child_tid; } __attribute__((aligned(8))); +/* task's shared resources */ +struct ckpt_hdr_task_objs { + struct ckpt_hdr h; + __s32 files_objref; +} __attribute__((aligned(8))); + /* restart blocks */ struct ckpt_hdr_restart_block { struct ckpt_hdr h; @@ -220,4 +241,42 @@ enum restart_block_type { #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX }; +/* file system */ +struct ckpt_hdr_file_table { + struct ckpt_hdr h; + __s32 fdt_nfds; +} __attribute__((aligned(8))); + +/* file descriptors */ +struct ckpt_hdr_file_desc { + struct ckpt_hdr h; + __s32 fd_objref; + __s32 fd_descriptor; + __u32 fd_close_on_exec; +} __attribute__((aligned(8))); + +enum file_type { + CKPT_FILE_IGNORE = 0, +#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE + CKPT_FILE_GENERIC, +#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC + CKPT_FILE_MAX +#define CKPT_FILE_MAX CKPT_FILE_MAX +}; + +/* file objects */ +struct ckpt_hdr_file { + struct ckpt_hdr h; + __u32 f_type; + __u32 f_mode; + __u32 f_flags; + __u32 _padding; + __u64 f_pos; + __u64 f_version; +} __attribute__((aligned(8))); + +struct ckpt_hdr_file_generic { + struct ckpt_hdr_file common; +} __attribute__((aligned(8))); + #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 90bbb16..aae6755 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -14,6 +14,8 @@ #include <linux/sched.h> #include <linux/nsproxy.h> +#include <linux/list.h> +#include <linux/path.h> #include <linux/fs.h> #include <linux/ktime.h> #include <linux/wait.h> @@ -40,6 +42,9 @@ struct ckpt_ctx { atomic_t refcount; struct ckpt_obj_hash *obj_hash; /* repository for shared objects */ + struct deferqueue_head *files_deferq; /* deferred file-table work */ + + struct path root_fs_path; /* container root (FIXME) */ struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 65ebec5..7902a51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t); extern void locks_remove_flock(struct file *); extern void locks_release_private(struct file_lock *); extern void posix_test_lock(struct file *, struct file_lock *); +extern int find_locks_with_owner(struct file *filp, fl_owner_t owner); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); extern int posix_lock_file_wait(struct file *, struct file_lock *); extern int posix_unblock_lock(struct file *, struct file_lock *); @@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner) return; } +static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + return -ENOENT; +} + static inline void locks_remove_flock(struct file *filp) { return; @@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#ifdef CONFIG_CHECKPOINT +extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file); +#else #define generic_file_checkpoint NULL +#endif extern int vfs_readdir(struct file *, filldir_t, void *); -- 1.6.3.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 88+ messages in thread
* [C/R v20][PATCH 38/96] c/r: dump open file descriptors @ 2010-03-17 16:08 ` Oren Laadan 0 siblings, 0 replies; 88+ messages in thread From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar, containers, Oren Laadan Dump the file table with 'struct ckpt_hdr_file_table, followed by all open file descriptors. Because the 'struct file' corresponding to an fd can be shared, they are assigned an objref and registered in the object hash. A reference to the 'file *' is kept for as long as it lives in the hash (the hash is only cleaned up at the end of the checkpoint). Also provide generic_checkpoint_file() and generic_restore_file() which is good for normal files and directories. It does not support yet unlinked files or directories. Changelog[v19]: - Fix false negative of test for unlinked files at checkpoint Changelog[v19-rc3]: - [Serge Hallyn] Rename fs_mnt to root_fs_path - [Dave Hansen] Error out on file locks and leases - [Serge Hallyn] Refuse checkpoint of file with f_owner Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Add a few more ckpt_write_err()s - [Dan Smith] Export fill_fname() as ckpt_fill_fname() - Introduce ckpt_collect_file() that also uses file->collect method - In collect_file_stabl() use retval from ckpt_obj_collect() to test for first-time-object Changelog[v17]: - Only collect sub-objects of files_struct once - Better file error debugging - Use (new) d_unlinked() Changelog[v16]: - Fix compile warning in checkpoint_bad() Changelog[v16]: - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - File objects are dumped/restored prior to the first reference - Introduce a per file-type restore() callback - Use struct file_operations->checkpoint() - Put code for generic file descriptors in a separate function - Use one CKPT_FILE_GENERIC for both regular files and dirs - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Rename: ckpt_write_files() => checkpoint_fd_table() - Rename: ckpt_write_fd_data() => checkpoint_file() - Discard field 'h->parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Discard handling of opened symlinks (there is no such thing) - ckpt_scan_fds() retries from scratch if hits size limits Changelog[v9]: - Fix a couple of leaks in ckpt_write_files() - Drop useless kfree from ckpt_scan_fds() Changelog[v8]: - initialize 'coe' to workaround gcc false warning Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/Makefile | 3 +- checkpoint/checkpoint.c | 11 + checkpoint/files.c | 444 ++++++++++++++++++++++++++++++++++++++ checkpoint/objhash.c | 52 +++++ checkpoint/process.c | 33 +++- checkpoint/sys.c | 8 + fs/locks.c | 35 +++ include/linux/checkpoint.h | 19 ++ include/linux/checkpoint_hdr.h | 59 +++++ include/linux/checkpoint_types.h | 5 + include/linux/fs.h | 10 + 11 files changed, 677 insertions(+), 2 deletions(-) create mode 100644 checkpoint/files.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 5aa6a75..1d0c058 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \ objhash.o \ checkpoint.o \ restart.o \ - process.o + process.o \ + files.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index c016a2d..2bc2495 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -18,6 +18,7 @@ #include <linux/time.h> #include <linux/fs.h> #include <linux/file.h> +#include <linux/fs_struct.h> #include <linux/dcache.h> #include <linux/mount.h> #include <linux/utsname.h> @@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) { struct task_struct *task; struct nsproxy *nsproxy; + struct fs_struct *fs; /* * No need for explicit cleanup here, because if an error @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) return -EINVAL; /* cleanup by ckpt_ctx_free() */ } + /* root vfs (FIX: WILL CHANGE with mnt-ns etc */ + task_lock(ctx->root_task); + fs = ctx->root_task->fs; + read_lock(&fs->lock); + ctx->root_fs_path = fs->root; + path_get(&ctx->root_fs_path); + read_unlock(&fs->lock); + task_unlock(ctx->root_task); + return 0; } diff --git a/checkpoint/files.c b/checkpoint/files.c new file mode 100644 index 0000000..7a57b24 --- /dev/null +++ b/checkpoint/files.c @@ -0,0 +1,444 @@ +/* + * Checkpoint file descriptors + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DFILE + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/deferqueue.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + + +/************************************************************************** + * Checkpoint + */ + +/** + * ckpt_fill_fname - return pathname of a given file + * @path: path name + * @root: relative root + * @buf: buffer for pathname + * @len: buffer length (in) and pathname length (out) + */ +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len) +{ + struct path tmp = *root; + char *fname; + + BUG_ON(!buf); + spin_lock(&dcache_lock); + fname = __d_path(path, &tmp, buf, *len); + spin_unlock(&dcache_lock); + if (IS_ERR(fname)) + return fname; + *len = (buf + (*len) - fname); + /* + * FIX: if __d_path() changed these, it must have stepped out of + * init's namespace. Since currently we require a unified namespace + * within the container: simply fail. + */ + if (tmp.mnt != root->mnt || tmp.dentry != root->dentry) + fname = ERR_PTR(-EBADF); + + return fname; +} + +/** + * checkpoint_fname - write a file name + * @ctx: checkpoint context + * @path: path name + * @root: relative root + */ +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root) +{ + char *buf, *fname; + int ret, flen; + + /* + * FIXME: we can optimize and save memory (and storage) if we + * share strings (through objhash) and reference them instead + */ + + flen = PATH_MAX; + buf = kmalloc(flen, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + fname = ckpt_fill_fname(path, root, buf, &flen); + if (!IS_ERR(fname)) { + ret = ckpt_write_obj_type(ctx, fname, flen, + CKPT_HDR_FILE_NAME); + } else { + ret = PTR_ERR(fname); + ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n", + path->dentry->d_name.name); + } + + kfree(buf); + return ret; +} + +#define CKPT_DEFAULT_FDTABLE 256 /* an initial guess */ + +/** + * scan_fds - scan file table and construct array of open fds + * @files: files_struct pointer + * @fdtable: (output) array of open fds + * + * Returns the number of open fds found, and also the file table + * array via *fdtable. The caller should free the array. + * + * The caller must validate the file descriptors collected in the + * array before using them, e.g. by using fcheck_files(), in case + * the task's fdtable changes in the meantime. + */ +static int scan_fds(struct files_struct *files, int **fdtable) +{ + struct fdtable *fdt; + int *fds = NULL; + int i = 0, n = 0; + int tot = CKPT_DEFAULT_FDTABLE; + + /* + * We assume that all tasks possibly sharing the file table are + * frozen (or we are a single process and we checkpoint ourselves). + * Therefore, we can safely proceed after krealloc() from where we + * left off. Otherwise the file table may be modified by another + * task after we scan it. The behavior is this case is undefined, + * and either checkpoint or restart will likely fail. + */ + retry: + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL); + if (!fds) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + for (/**/; i < fdt->max_fds; i++) { + if (!fcheck_files(files, i)) + continue; + if (n == tot) { + rcu_read_unlock(); + tot *= 2; /* won't overflow: kmalloc will fail */ + goto retry; + } + fds[n++] = i; + } + rcu_read_unlock(); + + *fdtable = fds; + return n; +} + +int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + h->f_flags = file->f_flags; + h->f_mode = file->f_mode; + h->f_pos = file->f_pos; + h->f_version = file->f_version; + + ckpt_debug("file %s credref %d", file->f_dentry->d_name.name, + h->f_credref); + + /* FIX: need also file->uid, file->gid, file->f_owner, etc */ + + return 0; +} + +int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_generic *h; + int ret; + + /* + * FIXME: when we'll add support for unlinked files/dirs, we'll + * need to distinguish between unlinked filed and unlinked dirs. + */ + if (d_unlinked(file->f_dentry)) { + ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n", + file); + return -EBADF; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_GENERIC; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, &h->common.h); + if (ret < 0) + goto out; + ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path); + out: + ckpt_hdr_put(ctx, h); + return ret; +} +EXPORT_SYMBOL(generic_file_checkpoint); + +/* checkpoint callback for file pointer */ +int checkpoint_file(struct ckpt_ctx *ctx, void *ptr) +{ + struct file *file = (struct file *) ptr; + int ret; + + if (!file->f_op || !file->f_op->checkpoint) { + ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n", + file, file->f_op); + return -EBADF; + } + + ret = file->f_op->checkpoint(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file); + return ret; +} + +/** + * ckpt_write_file_desc - dump the state of a given file descriptor + * @ctx: checkpoint context + * @files: files_struct pointer + * @fd: file descriptor + * + * Saves the state of the file descriptor; looks up the actual file + * pointer in the hash table, and if found saves the matching objref, + * otherwise calls ckpt_write_file to dump the file pointer too. + */ +static int checkpoint_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct ckpt_hdr_file_desc *h; + struct file *file = NULL; + struct fdtable *fdt; + int objref, ret; + int coe = 0; /* avoid gcc warning */ + pid_t pid; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC); + if (!h) + return -ENOMEM; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) { + coe = FD_ISSET(fd, fdt->close_on_exec); + get_file(file); + } + rcu_read_unlock(); + + ret = find_locks_with_owner(file, files); + /* + * find_locks_with_owner() returns an error when there + * are no locks found, so we *want* it to return an error + * code. Its success means we have to fail the checkpoint. + */ + if (!ret) { + ret = -EBADF; + ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd); + goto out; + } + + /* sanity check (although this shouldn't happen) */ + ret = -EBADF; + if (!file) { + ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd); + goto out; + } + + /* + * TODO: Implement c/r of fowner and f_sigio. Should be + * trivial, but for now we just refuse its checkpoint + */ + pid = f_getown(file); + if (pid) { + ret = -EBUSY; + ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd); + goto out; + } + + /* + * if seen first time, this will add 'file' to the objhash, keep + * a reference to it, dump its state while at it. + */ + objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE); + ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe); + if (objref < 0) { + ret = objref; + goto out; + } + + h->fd_objref = objref; + h->fd_descriptor = fd; + h->fd_close_on_exec = coe; + + ret = ckpt_write_obj(ctx, &h->h); +out: + ckpt_hdr_put(ctx, h); + if (file) + fput(file); + return ret; +} + +static int do_checkpoint_file_table(struct ckpt_ctx *ctx, + struct files_struct *files) +{ + struct ckpt_hdr_file_table *h; + int *fdtable = NULL; + int nfds, n, ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE); + if (!h) + return -ENOMEM; + + nfds = scan_fds(files, &fdtable); + if (nfds < 0) { + ret = nfds; + goto out; + } + + h->fdt_nfds = nfds; + + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + if (ret < 0) + goto out; + + ckpt_debug("nfds %d\n", nfds); + for (n = 0; n < nfds; n++) { + ret = checkpoint_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + goto out; + } + + ret = deferqueue_run(ctx->files_deferq); + ckpt_debug("files_deferq ran %d entries\n", ret); + if (ret > 0) + ret = 0; + out: + kfree(fdtable); + return ret; +} + +/* checkpoint callback for file table */ +int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_file_table(ctx, (struct files_struct *) ptr); +} + +/* checkpoint wrapper for file table */ +int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int objref; + + files = get_files_struct(t); + if (!files) + return -EBUSY; + objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE); + put_files_struct(files); + + return objref; +} + +/*********************************************************************** + * Collect + */ + +int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file) +{ + int ret; + + ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE); + if (ret <= 0) + return ret; + /* if first time for this file (ret > 0), invoke ->collect() */ + if (file->f_op->collect) + ret = file->f_op->collect(ctx, file); + if (ret < 0) + ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file); + return ret; +} + +static int collect_file_desc(struct ckpt_ctx *ctx, + struct files_struct *files, int fd) +{ + struct fdtable *fdt; + struct file *file; + int ret; + + rcu_read_lock(); + fdt = files_fdtable(files); + file = fcheck_files(files, fd); + if (file) + get_file(file); + rcu_read_unlock(); + + if (!file) { + ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file); + return -EBUSY; + } + + ret = ckpt_collect_file(ctx, file); + fput(file); + + return ret; +} + +static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files) +{ + int *fdtable; + int nfds, n; + int ret; + + /* if already exists (ret == 0), nothing to do */ + ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE); + if (ret <= 0) + return ret; + + /* if first time for this file table (ret > 0), proceed inside */ + nfds = scan_fds(files, &fdtable); + if (nfds < 0) + return nfds; + + for (n = 0; n < nfds; n++) { + ret = collect_file_desc(ctx, files, fdtable[n]); + if (ret < 0) + break; + } + + kfree(fdtable); + return ret; +} + +int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct files_struct *files; + int ret; + + files = get_files_struct(t); + if (!files) { + ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n"); + return -EBUSY; + } + ret = collect_file_table(ctx, files); + put_files_struct(files); + + return ret; +} diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 22b1601..f25d130 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -13,6 +13,8 @@ #include <linux/kernel.h> #include <linux/hash.h> +#include <linux/file.h> +#include <linux/fdtable.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr) return 0; } +static int obj_file_table_grab(void *ptr) +{ + atomic_inc(&((struct files_struct *) ptr)->count); + return 0; +} + +static void obj_file_table_drop(void *ptr, int lastref) +{ + put_files_struct((struct files_struct *) ptr); +} + +static int obj_file_table_users(void *ptr) +{ + return atomic_read(&((struct files_struct *) ptr)->count); +} + +static int obj_file_grab(void *ptr) +{ + get_file((struct file *) ptr); + return 0; +} + +static void obj_file_drop(void *ptr, int lastref) +{ + fput((struct file *) ptr); +} + +static int obj_file_users(void *ptr) +{ + return atomic_long_read(&((struct file *) ptr)->f_count); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .ref_drop = obj_no_drop, .ref_grab = obj_no_grab, }, + /* files_struct object */ + { + .obj_name = "FILE_TABLE", + .obj_type = CKPT_OBJ_FILE_TABLE, + .ref_drop = obj_file_table_drop, + .ref_grab = obj_file_table_grab, + .ref_users = obj_file_table_users, + .checkpoint = checkpoint_file_table, + }, + /* file object */ + { + .obj_name = "FILE", + .obj_type = CKPT_OBJ_FILE, + .ref_drop = obj_file_drop, + .ref_grab = obj_file_grab, + .ref_users = obj_file_users, + .checkpoint = checkpoint_file, + }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index ef394a5..adc34a2 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN); } +static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_task_objs *h; + int files_objref; + int ret; + + files_objref = checkpoint_obj_file_table(ctx, t); + ckpt_debug("files: objref %d\n", files_objref); + if (files_objref < 0) { + ckpt_err(ctx, files_objref, "%(T)files_struct\n"); + return files_objref; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS); + if (!h) + return -ENOMEM; + h->files_objref = files_objref; + ret = ckpt_write_obj(ctx, &h->h); + ckpt_hdr_put(ctx, h); + + return ret; +} + /* dump the task_struct of a given task */ int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t) { @@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) goto out; ret = checkpoint_cpu(ctx, t); ckpt_debug("cpu %d\n", ret); + if (ret < 0) + goto out; + ret = checkpoint_task_objs(ctx, t); + ckpt_debug("objs %d\n", ret); out: ctx->tsk = NULL; return ret; @@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t) { - return 0; + int ret; + + ret = ckpt_collect_file_table(ctx, t); + + return ret; } /*********************************************************************** diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 926c937..30b8004 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx) if (ctx->kflags & CKPT_CTX_RESTART) restore_debug_free(ctx); + if (ctx->files_deferq) + deferqueue_destroy(ctx->files_deferq); + if (ctx->file) fput(ctx->file); if (ctx->logfile) fput(ctx->logfile); ckpt_obj_hash_free(ctx); + path_put(&ctx->root_fs_path); if (ctx->tasks_arr) task_arr_free(ctx); @@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, if (ckpt_obj_hash_alloc(ctx) < 0) goto err; + ctx->files_deferq = deferqueue_create(); + if (!ctx->files_deferq) + goto err; + atomic_inc(&ctx->refcount); return ctx; err: diff --git a/fs/locks.c b/fs/locks.c index a8794f2..721481a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner) EXPORT_SYMBOL(locks_remove_posix); +int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + struct inode *inode = filp->f_path.dentry->d_inode; + struct file_lock **inode_fl; + int ret = -EEXIST; + + lock_kernel(); + for_each_lock(inode, inode_fl) { + struct file_lock *fl = *inode_fl; + /* + * We could use posix_same_owner() along with a 'fake' + * file_lock. But, the fake file will never have the + * same fl_lmops as the fl that we are looking for and + * posix_same_owner() would just fall back to this + * check anyway. + */ + if (IS_POSIX(fl)) { + if (fl->fl_owner == owner) { + ret = 0; + break; + } + } else if (IS_FLOCK(fl) || IS_LEASE(fl)) { + if (fl->fl_file == filp) { + ret = 0; + break; + } + } else { + WARN(1, "unknown file lock type, fl_flags: %x", + fl->fl_flags); + } + } + unlock_kernel(); + return ret; +} + /* * This function is called on the last close of an open file. */ diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 50ce8f9..d74a890 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx, extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max); extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type); +extern char *ckpt_fill_fname(struct path *path, struct path *root, + char *buf, int *len); + /* ckpt kflags */ #define ckpt_set_ctx_kflag(__ctx, __kflag) \ set_bit(__kflag##_BIT, &(__ctx)->kflags) @@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t); extern int restore_restart_block(struct ckpt_ctx *ctx); +/* file table */ +extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t); +extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, + struct task_struct *t); +extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr); + +/* files */ +extern int checkpoint_fname(struct ckpt_ctx *ctx, + struct path *path, struct path *root); +extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file); +extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr); + +extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h); + static inline int ckpt_validate_errno(int errno) { return (errno >= 0) && (errno < MAX_ERRNO); @@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno) #define CKPT_DSYS 0x2 /* generic (system) */ #define CKPT_DRW 0x4 /* image read/write */ #define CKPT_DOBJ 0x8 /* shared objects */ +#define CKPT_DFILE 0x10 /* files and filesystem */ #define CKPT_DDEFAULT 0xffff /* default debug level */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index cdca9e4..3222545 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -71,6 +71,8 @@ enum { #define CKPT_HDR_TREE CKPT_HDR_TREE CKPT_HDR_TASK, #define CKPT_HDR_TASK CKPT_HDR_TASK + CKPT_HDR_TASK_OBJS, +#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS CKPT_HDR_RESTART_BLOCK, #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK CKPT_HDR_THREAD, @@ -80,6 +82,15 @@ enum { /* 201-299: reserved for arch-dependent */ + CKPT_HDR_FILE_TABLE = 301, +#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE + CKPT_HDR_FILE_DESC, +#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC + CKPT_HDR_FILE_NAME, +#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME + CKPT_HDR_FILE, +#define CKPT_HDR_FILE CKPT_HDR_FILE + CKPT_HDR_TAIL = 9001, #define CKPT_HDR_TAIL CKPT_HDR_TAIL @@ -106,6 +117,10 @@ struct ckpt_hdr_objref { enum obj_type { CKPT_OBJ_IGNORE = 0, #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE + CKPT_OBJ_FILE_TABLE, +#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE + CKPT_OBJ_FILE, +#define CKPT_OBJ_FILE CKPT_OBJ_FILE CKPT_OBJ_MAX #define CKPT_OBJ_MAX CKPT_OBJ_MAX }; @@ -188,6 +203,12 @@ struct ckpt_hdr_task { __u64 clear_child_tid; } __attribute__((aligned(8))); +/* task's shared resources */ +struct ckpt_hdr_task_objs { + struct ckpt_hdr h; + __s32 files_objref; +} __attribute__((aligned(8))); + /* restart blocks */ struct ckpt_hdr_restart_block { struct ckpt_hdr h; @@ -220,4 +241,42 @@ enum restart_block_type { #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX }; +/* file system */ +struct ckpt_hdr_file_table { + struct ckpt_hdr h; + __s32 fdt_nfds; +} __attribute__((aligned(8))); + +/* file descriptors */ +struct ckpt_hdr_file_desc { + struct ckpt_hdr h; + __s32 fd_objref; + __s32 fd_descriptor; + __u32 fd_close_on_exec; +} __attribute__((aligned(8))); + +enum file_type { + CKPT_FILE_IGNORE = 0, +#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE + CKPT_FILE_GENERIC, +#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC + CKPT_FILE_MAX +#define CKPT_FILE_MAX CKPT_FILE_MAX +}; + +/* file objects */ +struct ckpt_hdr_file { + struct ckpt_hdr h; + __u32 f_type; + __u32 f_mode; + __u32 f_flags; + __u32 _padding; + __u64 f_pos; + __u64 f_version; +} __attribute__((aligned(8))); + +struct ckpt_hdr_file_generic { + struct ckpt_hdr_file common; +} __attribute__((aligned(8))); + #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 90bbb16..aae6755 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -14,6 +14,8 @@ #include <linux/sched.h> #include <linux/nsproxy.h> +#include <linux/list.h> +#include <linux/path.h> #include <linux/fs.h> #include <linux/ktime.h> #include <linux/wait.h> @@ -40,6 +42,9 @@ struct ckpt_ctx { atomic_t refcount; struct ckpt_obj_hash *obj_hash; /* repository for shared objects */ + struct deferqueue_head *files_deferq; /* deferred file-table work */ + + struct path root_fs_path; /* container root (FIXME) */ struct task_struct *tsk;/* checkpoint: current target task */ char err_string[256]; /* checkpoint: error string */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 65ebec5..7902a51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t); extern void locks_remove_flock(struct file *); extern void locks_release_private(struct file_lock *); extern void posix_test_lock(struct file *, struct file_lock *); +extern int find_locks_with_owner(struct file *filp, fl_owner_t owner); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); extern int posix_lock_file_wait(struct file *, struct file_lock *); extern int posix_unblock_lock(struct file *, struct file_lock *); @@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner) return; } +static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner) +{ + return -ENOENT; +} + static inline void locks_remove_flock(struct file *filp) { return; @@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#ifdef CONFIG_CHECKPOINT +extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file); +#else #define generic_file_checkpoint NULL +#endif extern int vfs_readdir(struct file *, filldir_t, void *); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 88+ messages in thread
end of thread, other threads:[~2010-03-23 0:56 UTC | newest]
Thread overview: 88+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-19 0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
[not found] ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-22 6:31 ` Nick Piggin
2010-03-22 6:31 ` Nick Piggin
2010-03-23 0:12 ` Oren Laadan
2010-03-23 0:43 ` Nick Piggin
2010-03-23 0:56 ` Oren Laadan
2010-03-23 0:56 ` Oren Laadan
[not found] ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2010-03-23 0:43 ` Nick Piggin
2010-03-23 0:12 ` Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
[not found] ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-22 6:34 ` Nick Piggin
2010-03-22 6:34 ` Nick Piggin
2010-03-22 10:16 ` Matt Helsley
2010-03-22 10:16 ` Matt Helsley
[not found] ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 11:00 ` Nick Piggin
2010-03-22 11:00 ` Nick Piggin
2010-03-19 0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
2010-03-19 23:19 ` Andreas Dilger
2010-03-20 4:43 ` Matt Helsley
[not found] ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-21 17:27 ` Jamie Lokier
2010-03-21 17:27 ` Jamie Lokier
[not found] ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org>
2010-03-21 19:40 ` Serge E. Hallyn
2010-03-22 1:06 ` Matt Helsley
2010-03-21 19:40 ` Serge E. Hallyn
[not found] ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2010-03-21 20:58 ` Daniel Lezcano
2010-03-21 20:58 ` Daniel Lezcano
[not found] ` <4BA68884.3080003-GANU6spQydw@public.gmane.org>
2010-03-21 21:36 ` Oren Laadan
2010-03-22 2:12 ` Matt Helsley
2010-03-21 21:36 ` Oren Laadan
2010-03-22 8:40 ` Daniel Lezcano
[not found] ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-21 23:31 ` xing lin
2010-03-22 8:40 ` Daniel Lezcano
2010-03-22 2:12 ` Matt Helsley
[not found] ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 13:51 ` Jamie Lokier
2010-03-22 23:18 ` Andreas Dilger
2010-03-22 13:51 ` Jamie Lokier
2010-03-22 23:18 ` Andreas Dilger
2010-03-22 1:06 ` Matt Helsley
[not found] ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 2:20 ` Jamie Lokier
2010-03-22 2:55 ` Serge E. Hallyn
2010-03-22 2:20 ` Jamie Lokier
[not found] ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org>
2010-03-22 3:37 ` Matt Helsley
2010-03-22 3:37 ` Matt Helsley
[not found] ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 14:13 ` Jamie Lokier
2010-03-22 14:13 ` Jamie Lokier
2010-03-22 2:55 ` Serge E. Hallyn
[not found] ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org>
2010-03-20 4:43 ` Matt Helsley
[not found] ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-19 23:19 ` Andreas Dilger
2010-03-22 10:30 ` Nick Piggin
2010-03-22 13:22 ` Matt Helsley
2010-03-22 13:22 ` Matt Helsley
2010-03-22 13:38 ` Nick Piggin
[not found] ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 13:38 ` Nick Piggin
2010-03-19 0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
[not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-19 0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
2010-03-19 1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
2010-03-19 1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
2010-03-19 0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
2010-03-19 1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
2010-03-19 1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
-- strict thread matches above, loose matches on Subject: below --
2010-03-17 16:07 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 01/96] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 02/96] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 03/96] eclone (3/11): Define set_pidmap() function Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 04/96] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 05/96] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 06/96] eclone (6/11): Check invalid clone flags Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 07/96] eclone (7/11): Define do_fork_with_pids() Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 08/96] eclone (8/11): Implement sys_eclone for x86 (32,64) Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 09/96] eclone (9/11): Implement sys_eclone for s390 Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 10/96] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 11/96] eclone (11/11): Document sys_eclone Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 12/96] c/r: extend arch_setup_additional_pages() Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 13/96] c/r: break out new_user_ns() Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 14/96] c/r: split core function out of some set*{u,g}id functions Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 15/96] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 16/96] cgroup freezer: Update stale locking comments Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 17/96] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 18/96] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 19/96] Namespaces submenu Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 21/96] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 22/96] c/r: documentation Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 23/96] c/r: basic infrastructure for checkpoint/restart Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 24/96] c/r: x86_32 support " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 25/96] c/r: x86-64: checkpoint/restart implementation Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 26/96] c/r: external checkpoint of a task other than ourself Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 27/96] c/r: export functionality used in next patch for restart-blocks Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 28/96] c/r: restart-blocks Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 29/96] c/r: checkpoint multiple processes Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 30/96] c/r: restart " Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 31/96] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 32/96] c/r: support for zombie processes Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 33/96] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 34/96] c/r: infrastructure for shared objects Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 35/96] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 36/96] deferqueue: generic queue to defer work Oren Laadan
2010-03-17 16:08 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
[not found] ` <1268842164-5590-38-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-17 16:08 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
2010-03-17 16:08 ` Oren Laadan
2010-03-17 16:08 ` Oren Laadan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.