linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v21 00/100] Kernel based checkpoint/restart
@ 2010-05-01 14:14 Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
                   ` (101 more replies)
  0 siblings, 102 replies; 137+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Oren Laadan

Hi Andrew,

Here is the next version of the checkpoint/restart patchset.  This
version moves portions of checkpoint code closer to where they belong.

As a convenience we've collected a rough table of contents showing
places to start for some reviewers with limited time and/or scope
(see below).

Thanks to Jamie, Nick, Andreas, and all who helped review the last few
versions, and thanks in advance for comments on this version.

We'll be very grateful if this can get a spin in -mm to get some wider
testing in the meantime.

Thanks,

The Checkpoint/Restart developers.

---

Linux Checkpoint-Restart:
 web, wiki:	http://www.linux-cr.org
 bug track:	https://www.linux-cr.org/redmine

The repositories for the project are in:
 kernel:	http://www.linux-cr.org/git/?p=linux-cr.git;a=summary
 user tools:	http://www.linux-cr.org/git/?p=user-cr.git;a=summary
 tests suite:	http://www.linux-cr.org/git/?p=tests-cr.git;a=summary

---

TABLE OF CONTENTS

Patches                 Area/Role
-------------------------------------------------------------------------
11,20                   Documentation (eclone, c/r)
8-11,21,22,27,28        Syscall gluey bits

12                      Arch Maintainers
8,22-24                         x86-32/64
9,58,60                         s390
10,84-88                        powerpc

14,61-63,69,70,         Security
71,89-92,

33,34,35                Generic c/r
                        (shared "object" hash, leak detection, deferqueues)

25,27-31                Processes
5-7                       fork (eclone)
39-41,45,46               memory
13,18,51,52,54,           namespaces
81-83,94
53-57                     ipc
64-67                     signals
1-4,70,83                 pids, pgids, tids, tgids (eclone or pidns)
14,61,62,69               creds, capabilities, uids, gids
71                        sockets
76-78                     terminals (specifically pty)
27,28,32                  futexes (27,28 relate to futex syscalls restart)

39-41,45,46,55            mm (basically process memory)

15-17                   Cgroups

71-75,93-99             Networking

19,36-38,42-44,         Filesystems (also pseudo-filesystems, anon_inodes)
47-50,63,76-77,
79-82

Some patches show up in multiple places because they are functionally
related even though they cross Area/Role boundaries. While we've done our
best to make the table above comprehensive, it's entirely conceivable that
we've neglected a small piece of a largely unrelated patch. Please feel
free to point these out to Matt Helsley <matthltc@us.ibm.com> since he's
largely responsible for this table.

---

CHANGELOG:

[2010-Apr-30] v21
  - Add relevant maintainers/lists as Cc: in patch descriptions
  - Reorganize code: move checkpoint/* to kernel/checkpoint/*
  - Reorganize filesystem code into fs/*
  - Merge files dump/restore into a single patch
  - Merge mm dump/restore into a single patch
  - Move utsns c/r code from checkpoint/namespace.c to kernel/utsname*.c
  - [Matt Helsley] Move the signal c/r changes to kernel/signal.c
  - Move userns c/r code from to kernel/{user,cred,user_namespace}.c
  - Assorted fixes to bisectability of patchset
  - Do not include checkpoint_hdr.h explicitly
  - Subsystems/modules register shared objects types for c/r
  - [Serge Hallyn] CONFIG_SECURITY_FILE_CAPABILITIES has been gone awhile
  - [Dan Smith] Unbreak compiling with CONFIG_CHECKPOINT=n or CONFIG_NET_NS=n
  - [Dan Smith] Clean up the error path in restore_veth()
  - [Dan Smith] Fix acquiring socket lock before reading RTNETLINK response
  - [Dan Smith] Skip down interfaces (v2)
  - [Dan Smith] Export net checkpoint fns
  - [Dan Smith] Add CHECKPOINT_NETNS flag
  - [Dan Smith] Netdev restore function dispatching from a table
  - [Dan Smith] Comment on controverial determination of "initial netns"
  - [Dan Smith] Simplify the E2BIG error handling in netdev c/r
  - [Dan Smith] Remove a redundant check for checkpoint support per-device
  - [Nathan Lynch] powerpc: fix build break with CONFIG_CHECKPOINT=n 
  - [Matt Helsley] Eventfd: add missing spin locks around eventfd checkpoint
  - [Matt Helsley] Put file_ops->checkpoint under CONFIG_CHECKPOINT
  - [Dan Smith] Fix build when CONFIG_INET=n
  - [Dan Smith] Disable softirqs when taking the socket queue lock
  - Replace __initcall() with late_initcall()
  - [Serge Hallyn] Remove [] following individual ops definitions.
  - [Serge Hallyn] Fix compilation for when CONFIG_USER_NS=y
  - [Serge Hallyn] handle CONFIG_{SYSVIPC,SYSVIPC,POSIX_MQUEUE}=n
  - [Serge Hallyn] Remove namespace.o from kernel/checkpoint/Makefile
  - [Stanislav O. Bezzubtsev] Fix omitted parameter name error
  - Put file_ops->checkpoint under CONFIG_CHECKPOINT
  - [Serge] Print out full path of file which crossed mnt_ns
  - Update Documentation/filesystem/vfs.txt
  - Restore_obj() to tolerate a preexisting object in the hash
  - Add ckpt_obj_del() to objhash for handling error conditions
  - [Serge Hallyn] Replace BUG_ON() in obj_new with error returns
  - [Matt Helsley] Move CKPT_CTX_ERROR* definitions to first use.
  - [Nathan Lynch] x86: use task_user_gs to checkpoint gs
  - Complain if checkpoint_hdr.h included without CONFIG_CHECKPOINT
  - Introduce kernel_write(), fix kernel_read()
  - Consolidate ckpt_read/write with kernel_read/write
  - [Christoffer Dall] Fix trivial bug in ckpt_msg macro
  - [Serge Hallyn] user/group: address dhowells feedback
 
[2010-Mar-16] v20
 BUG FIXES (only)
  - [Serge Hallyn] Fix unlabeled restore case
  - [Serge Hallyn] Always restore msg_msg label
  - [Serge Hallyn] Selinux prevents msgrcv on restore message queues?
  - [Serge Hallyn] save_access_regs for self-checkpoint
  - [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages
  - Fix "scheduling in atomic" while restoring ipc (sem, shm, msg)
  - Cleanup: no need to restore perm->{id,key,seq}
  - Fix sysvipc=n compile
  - Make uts_ns=n compile
  - Only use arch_setup_additional_pages() if supported by arch
  - Export key symbols to enable c/r from kernel modules
  - Avoid crash if incoming object doesn't have .restore
  - Replace error_sem with an event completion
  - [Serge Hallyn] Change sysctl and default for unprivileged use
  - [Nathan Lynch] Use syscall_get_error
  - Add entry for checkpoint/restart in MAINTAINERS 

[2010-Feb-19] v19
 NEW FEATURES
  - Support for x86-64 architecture
  - Support for c/r of LSM (smack, selinux)
  - Support for c/r of task fs_root and pwd
  - Support for c/r of epoll
  - Support for c/r of eventfd
  - Enable C/R while executing over NFS
  - Preliminary c/r of mounts namespace
  - Add @logfd argument to sys_{checkpoint,restart} prototypes
  - Define new api for error and debug logging
  - Restart to handle checkpoint images lacking {uts,ipc}-ns
  - Refuse to checkpoint if monitoring directories with dnotify
  - Refuse to checkpoint if file locks and leases are held
  - Refuse to checkpoint files with f_owner 
 OTHER CHANGES
  - Rebase to kernel 2.6.33-rc8
  - Settled version of new sys_eclone()
  - [Serge Hallyn] Fix potential use-before-set return (vdso)
  - Update documentation and examples for new syscalls API (doc)
  - [Liu Alexander] Fix typos (doc)
  - [Serge Hallyn] Update checkpoint image format (doc)
  - [Serge Hallyn] Use ckpt_err() to for bad header values
  - sys_{checkpoint,restart} to use ptregs prototype
  - Set ctx->errno in do_ckpt_msg() if needed
  - Fix up headers so we can munge them for use by userspace
  - Multiple fixes to _ckpt_write_err() and friends
  - [Matt Helsley] Add cpp definitions for enums
  - [Serge Hallyn] Add global section container to image format
  - [Matt Helsley] Fix total byte read/write count for large images
  - ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr)
  - [Serge Hallyn] Use ckpt_err() for arch incompatbilities
  - Introduce walk_task_subtree() to iterate through descendants
  - Call restore_notify_error for restart (not checkpoint !)
  - Make kread/kwrite() abort if CKPT_CTX_ERROR is set
  - [Serge Hallyn] Move init_completion(&ctx->complete) to ctx_alloc
  - Simplify logic of tracking restarting tasks (->ctx)
  - Coordinator kills descendants on failure for proper cleanup
  - Prepare descendants needs PTRACE_MODE_ATTACH permissions
  - Threads wait for entire thread group before restoring
  - Add debug process-tree status during restart
  - Fix handling of bogus pid arg to sys_restart
  - In reparent_thread() test for PF_RESTARTING on parent
  - Keep __u32s in even groups for 32-64 bit compatibility
  - Define ckpt_obj_try_fetch
  - Disallow zero or negative objref during restart
  - Check for valid destructor before calling it (deferqueue)
  - Fix false negative of test for unlinked files at checkpoint
  - [Serge Hallyn] Rename fs_mnt to root_fs_path
  - Restore thread/cpu state early
  - Ensure null-termination of file names read from image
  - Fix compile warning in restore_open_fname()
  - Introduce FOLL_DIRTY to follow_page() for "dirty" pages
  - [Serge Hallyn] Checkpoint saved_auxv as u64s
  - Export filemap_checkpoint()
  - [Serge Hallyn] Disallow checkpoint of tasks with aio requests
  - Fix compilation failure when !CONFIG_CHEKCPOINT (regression)
  - Expose page write functions
  - Do not hold mmap_sem while checkpointing vma's
  - Do not hold mmap_sem when reading memory pages on restart
  -  Move consider_private_page() to mm/memory.c:__get_dirty_page()
  - [Serge Hallyn] move destroy_mm into mmap.c and remove size check
  - [Serge Hallyn] fill vdso (syscall32_setup_pages) for TIF_IA32/x86_64
  - [Serge Hallyn] Fix return value of read_pages_contents()
  - [Serge Hallyn] Change m_type to long, not int (ipc)
  - Don't free sma if it's an error on restore
  - Use task->saves_sigmask and drop task->checkpoint_data
  - [Serge Hallyn] Handle saved_sigmask at checkpoint
  - Defer restore of blocked signals mask during restart
  - Self-restart to tolerate missing PGIDs
  - [Serge Hallyn] skb->tail can be offset
  - Export and leverage sock_alloc_file()
  - [Nathan Lynch] Fix net/checkpoint.c for 64-bit
  - [Dan Smith] Unify skb read/write functions and handle fragmented buffers
  - [Dan Smith] Update buffer restore code to match the new format
  - [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n
  - [Dan Smith] Remove an unnecessary check on socket restart
  - [Dan Smith] Pass the stored sock->protocol into sock_create() on restore
  - Relax tcp.window_clamp value in INET restore
  - Restore gso_type fields on sockets and buffers for proper operation
  - Fix broken compilation for no-c/r architectures
  - Return -EBUSY (not BUG_ON) if fd is gone on restart
  - Fix the chunk size instead of auto-tune (epoll) 
 ARCH: x86 (32,64)
  - Use PTREGSCALL4 for sys_{checkpoint,restart}
  - Remove debug-reg support (need to redo with perf_events)
  - [Serge Hallyn] Support for ia32 (checkpoint, restart)
  - Split arch/x86/checkpoint.c to generic and 32bit specific parts
  - sys_{checkpoint,restore} to use ptregs
  - Allow X86_EFLAGS_RF on restart
  - [Serge Hallyn] Only allow 'restart' with same bit-ness as image.
  - Move checkpoint.c from arch/x86/mm->arch/x86/kernel 
 ARCH: s390 [Serge Hallyn]
  - Define s390x sys_restart wrapper
  - Fixes to restart-blocks logic and signal path
  - Fix checkpoint and restart compat wrappers
  - sys_{checkpoint,restore} to use ptregs
  - Use simpler test_task_thread to test current ti flags
  - Fix 31-bit s390 checkpoint/restart wrappers
  - Update sys_checkpoint (do_sys_checkpoint on all archs)
  - [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel 
 ARCH: powerpc [Nathan Lynch]
  - [Serge Hallyn] Add hook task_has_saved_sigmask()
  - Warn if full register state unavailable
  - Fix up checkpoint syscall, tidy restart
  - [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm->kernel} 

[2009-Sep-22] v18
 NEW FEATURES
  - [Nathan Lynch] Re-introduce powerpc support
  - Save/restore pseudo-terminals
  - Save/restore (pty) controlling terminals
  - Save/restore restore PGIDs
  - [Dan Smith] Save/restore unix domain sockets
  - Save/restore FIFOs
  - Save/restore pending signals
  - Save/restore rlimits
  - Save/restore itimers
  - [Matt Helsley] Handle many non-pseudo file-systems
 OTHER CHANGES
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Nathan Lynch] discard const from struct cred * where appropriate
  - [Serge Hallyn][s390] Set return value for self-checkpoint 
  - Handle kmalloc failure in restore_sem_array()
  - [IPC] Collect files used by shm objects
  - [IPC] Use file (not inode) as shared object on checkpoint of shm
  - More ckpt_write_err()s to give information on checkpoint failure
  - Adjust format of pipe buffer to include the mandatory pre-header
  - [LEAKS] Mark the backing file as visited at chekcpoint
  - Tighten checks on supported vma to checkpoint or restart
  - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
  - Introduce ckpt_collect_file() that also uses file->collect method
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - Fix leak-detection issue in collect_mm() (test for first-time obj)
  - Invoke set_close_on_exec() unconditionally on restart
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Interface to pass simple pointers as data with deferqueue
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace EAGAIN with EBUSY where necessary
  - Introduce CKPT_OBJ_VISITED in leak detection
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
  - Introduce ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read header only (w/o payload)
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile

[2009-Jul-21] v17
  - Introduce syscall clone_with_pids() to restore original pids
  - Support threads and zombies
  - Save/restore task->files
  - Save/restore task->sighand
  - Save/restore futex
  - Save/restore credentials
  - Introduce PF_RESTARTING to skip notifications on task exit
  - restart(2) allow caller to ask to freeze tasks after restart
  - restart(2) isn't idempotent: return -EINTR if interrupted
  - Improve debugging output handling 
  - Make multi-process restart logic more robust and complete
  - Correctly select return value for restarting tasks on success
  - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for frozen checkpointed tasks
  - Fix compilation without CONFIG_CHECKPOINT
  - Fix compilation with CONFIG_COMPAT
  - Fix headers includes and exports
  - Leak detection performed in two steps
  - Detect "inverse" leaks of objects (dis)appearing unexpectedly
  - Memory: save/restore mm->{flags,def_flags,saved_auxv}
  - Memory: only collect sub-objects of mm once (leak detection)
  - Files: validate f_mode after restore
  - Namespaces: leak detection for nsproxy sub-components
  - Namespaces: proper restart from namespace(s) without namespace(s)
  - Save global constants in header instead of per-object
  - IPC: replace sys_unshare() with create_ipc_ns()
  - IPC: restore objects in suitable namespace
  - IPC: correct behavior under !CONFIG_IPC_NS
  - UTS: save/restore all fields
  - UTS: replace sys_unshare() with create_uts_ns()
  - X86_32: sanitize cpu, debug, and segment registers on restart
  - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
  - cgroup_freezer: add interface to freeze a cgroup (given a task)

[2009-May-27] v16
  - Privilege checks for IPC checkpoint
  - Fix error string generation during checkpoint
  - Use kzalloc for header allocation
  - Restart blocks are arch-independent
  - Redo pipe c/r using splice
  - Fixes to s390 arch
  - Remove powerpc arch (temporary)
  - Explicitly restore ->nsproxy
  - All objects in image are precedeed by 'struct ckpt_hdr'
  - Fix leaks detection (and leaks)
  - Reorder of patchset
  - Misc bugs and compilation fixes

[2009-Apr-12] v15
  - Minor fixes

[2009-Apr-28] v14
  - Tested against kernel v2.6.30-rc3 on x86_32.
  - Refactor files chekpoint to use f_ops (file operations)
  - Refactor mm/vma to use vma_ops
  - Explicitly handle VDSO vma (and require compat mode)
  - Added code to c/r restat-blocks (restart timeout related syscalls)
  - Added code to c/r namespaces: uts, ipc (with Dan Smith)
  - Added code to c/r sysvipc (shm, msg, sem)
  - Support for VM_CLONE shared memory
  - Added resource leak detection for whole-container checkpoint
  - Added sysctl gauge to allow unprivileged restart/checkpoint
  - Improve and simplify the code and logic of shared objects
  - Rework image format: shared objects appear prior to their use
  - Merge checkpoint and restart functionality into same files
  - Massive renaming of functions: prefix "ckpt_" for generics,
    "checkpoint_" for checkpoint, and "restore_" for restart.
  - Report checkpoint errors as a valid (string record) in the output
  - Merged PPC architecture (by Nathan Lunch),
  - Requires updates to userspace tools too.
  - Misc nits and bug fixes

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops->checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
  - Refuse non-self checkpoint if a task isn't frozen
  - Use unsigned fields in checkpoint headers unless otherwise required
  - Rename functions in files c/r to better reflect their role
  - Add support for anonymous shared memory
  - Merge support for s390 arch (Dan Smith, Serge Hallyn)
    
[2008-Dec-03] v13:
  - Cleanups of 'struct cr_ctx' - remove unused fields
  - Misc fixes for comments
  
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.


^ permalink raw reply	[flat|nested] 137+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-29 10:31 Albert Cahalan
  2010-06-01 19:32 ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 137+ messages in thread
From: Albert Cahalan @ 2010-05-29 10:31 UTC (permalink / raw)
  To: linux-kernel, sukadev, randy.dunlap, linuxppc-dev

Sukadev Bhattiprolu writes:

> Randy Dunlap [randy.dunlap at oracle.com] wrote:
>>> base of the region allocated for stack. These architectures
>>> must pass in the size of the stack-region in ->child_stack_size.
>>
>>                               stack region
>>
>> Seems unfortunate that different architectures use
>> the fields differently.
>
> Yes and no. The field still has a single purpose, just that
> some architectures may not need it. We enforce that if unused
> on an architecture, the field must be 0. It looked like
> the easiest way to keep the API common across architectures.

Yuck. You're forcing userspace to have #ifdef messes or,
more likely, just not work on all architectures. There is
no reason to have field usage vary by architecture. The
original clone syscall was not designed with ia64 and hppa
in mind, and has been causing trouble ever since. Let's not
perpetuate the problem.

Given code like this:   stack_base = malloc(stack_size);
stack_base and stack_size are what the kernel needs.

I suspect that you chose the defective method for some reason
related to restarting processes that were created with the
older system calls. I can't say most of us even care, but in
that broken-already case your process restarter can make up
some numbers that will work. (for i386, the base could be the
lowest address in the vma in which %esp lies, or even address 0)

A related issue is that stack allocation and deallocation can
be quite painful: it is difficult (some assembly required) to
free one's own stack, and impossible if one is already dead.
We could use a flag to let the kernel handle allocation, with
the stack getting freed just after any ptracer gets a last look.
This issue is especially troublesome for me because the syscall
essentially requires per-thread memory to work; it is currently
extremely difficult to use the syscall in code which lacks that.

^ permalink raw reply	[flat|nested] 137+ messages in thread

end of thread, other threads:[~2010-06-10  9:16 UTC | newest]

Thread overview: 137+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-01 14:14 [PATCH v21 00/100] Kernel based checkpoint/restart Oren Laadan
2010-05-01 14:14 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
2010-05-01 22:10   ` David Miller
2010-05-02  0:14     ` Josh Boyer
2010-05-02  0:25     ` Matt Helsley
2010-05-03  8:48     ` Brian K. White
2010-05-03 21:02     ` Dave Hansen
2010-05-03 21:12       ` David Miller
2010-05-01 14:14 ` [PATCH v21 002/100] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan
2010-05-01 14:14 ` [PATCH v21 003/100] eclone (3/11): Define set_pidmap() function Oren Laadan
2010-05-01 14:14 ` [PATCH v21 004/100] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 005/100] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 006/100] eclone (6/11): Check invalid clone flags Oren Laadan
2010-05-01 14:14 ` [PATCH v21 007/100] eclone (7/11): Define do_fork_with_pids() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 008/100] eclone (8/11): Implement sys_eclone for x86 (32,64) Oren Laadan
2010-05-01 14:14 ` [PATCH v21 009/100] eclone (9/11): Implement sys_eclone for s390 Oren Laadan
2010-05-01 14:14 ` [PATCH v21 010/100] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan
2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
2010-05-05 21:14   ` Randy Dunlap
2010-05-05 22:25     ` Sukadev Bhattiprolu
2010-05-01 14:14 ` [PATCH v21 012/100] c/r: extend arch_setup_additional_pages() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 013/100] c/r: break out new_user_ns() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 014/100] c/r: split core function out of some set*{u,g}id functions Oren Laadan
2010-05-01 14:14 ` [PATCH v21 015/100] cgroup freezer: Update stale locking comments Oren Laadan
2010-05-06 19:40   ` Rafael J. Wysocki
2010-05-06 20:31     ` Matt Helsley
2010-05-06 22:34       ` Matt Helsley
2010-05-06 21:25     ` Oren Laadan
2010-05-10 21:01       ` Rafael J. Wysocki
2010-05-10 21:07         ` Matt Helsley
2010-05-10 21:12           ` Rafael J. Wysocki
2010-05-01 14:14 ` [PATCH v21 016/100] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
2010-05-01 14:14 ` [PATCH v21 017/100] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
2010-05-01 14:15 ` [PATCH v21 018/100] Namespaces submenu Oren Laadan
2010-05-01 14:15 ` [PATCH v21 019/100] Make file_pos_read/write() public and export kernel_write() Oren Laadan
2010-05-06 12:26   ` Josef Bacik
2010-05-01 14:15 ` [PATCH v21 020/100] c/r: documentation Oren Laadan
2010-05-06 20:27   ` Randy Dunlap
2010-05-07  6:54     ` Oren Laadan
2010-05-01 14:15 ` [PATCH v21 021/100] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
2010-05-01 14:15 ` [PATCH v21 022/100] c/r: basic infrastructure for checkpoint/restart Oren Laadan
2010-05-01 14:15 ` [PATCH v21 023/100] c/r: x86_32 support " Oren Laadan
2010-05-01 14:15 ` [PATCH v21 024/100] c/r: x86-64: checkpoint/restart implementation Oren Laadan
2010-05-01 14:15 ` [PATCH v21 025/100] c/r: external checkpoint of a task other than ourself Oren Laadan
2010-05-01 14:15 ` [PATCH v21 026/100] c/r: export functionality used in next patch for restart-blocks Oren Laadan
2010-05-01 14:15 ` [PATCH v21 027/100] c/r: restart-blocks Oren Laadan
2010-05-01 14:15 ` [PATCH v21 028/100] c/r: checkpoint multiple processes Oren Laadan
2010-05-01 14:15 ` [PATCH v21 029/100] c/r: restart " Oren Laadan
2010-05-01 14:15 ` [PATCH v21 030/100] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
2010-05-01 14:15 ` [PATCH v21 031/100] c/r: support for zombie processes Oren Laadan
2010-05-01 14:15 ` [PATCH v21 032/100] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
2010-05-03 16:10   ` Darren Hart
2010-05-03 18:02     ` Matt Helsley
2010-05-01 14:15 ` [PATCH v21 033/100] c/r: infrastructure for shared objects Oren Laadan
2010-05-01 14:15 ` [PATCH v21 034/100] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
2010-05-01 14:15 ` [PATCH v21 035/100] deferqueue: generic queue to defer work Oren Laadan
2010-05-01 14:15 ` [PATCH v21 036/100] c/r: introduce vfs_fcntl() Oren Laadan
2010-05-01 14:15 ` [PATCH v21 037/100] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
2010-05-01 14:15 ` [PATCH v21 038/100] c/r: checkpoint and restart open file descriptors Oren Laadan
2010-05-01 14:15 ` [PATCH v21 039/100] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2010-05-01 14:15 ` [PATCH v21 040/100] Introduce FOLL_DIRTY to follow_page() for "dirty" pages Oren Laadan
2010-05-01 14:15 ` [PATCH v21 041/100] c/r: dump memory address space (private memory) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 042/100] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2010-05-01 14:15 ` [PATCH v21 043/100] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2010-05-01 14:15 ` [PATCH v21 044/100] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
2010-05-01 14:15 ` [PATCH v21 045/100] c/r: export shmem_getpage() to support shared memory Oren Laadan
2010-05-01 14:15 ` [PATCH v21 046/100] c/r: dump anonymous- and file-mapped- " Oren Laadan
2010-05-01 14:15 ` [PATCH v21 047/100] splice: export pipe/file-to-pipe/file functionality Oren Laadan
2010-05-01 14:15 ` [PATCH v21 048/100] c/r: support for open pipes Oren Laadan
2010-05-01 14:15 ` [PATCH v21 049/100] c/r: checkpoint and restore FIFOs Oren Laadan
2010-05-01 14:15 ` [PATCH v21 050/100] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
2010-05-01 14:15 ` [PATCH v21 051/100] c/r: make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
2010-05-01 14:15 ` [PATCH v21 052/100] c/r: support for UTS namespace Oren Laadan
2010-05-01 14:15 ` [PATCH v21 053/100] c/r (ipc): allow allocation of a desired ipc identifier Oren Laadan
2010-05-07 16:32   ` Manfred Spraul
2010-05-07 17:08     ` Oren Laadan
2010-05-01 14:15 ` [PATCH v21 054/100] c/r: save and restore sysvipc namespace basics Oren Laadan
2010-05-01 14:15 ` [PATCH v21 055/100] c/r: support share-memory sysv-ipc Oren Laadan
2010-05-01 14:15 ` [PATCH v21 056/100] c/r: support message-queues sysv-ipc Oren Laadan
2010-05-01 14:15 ` [PATCH v21 057/100] c/r: support semaphore sysv-ipc Oren Laadan
2010-05-01 14:15 ` [PATCH v21 058/100] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 059/100] c/r: add CKPT_COPY() macro Oren Laadan
2010-05-01 14:15 ` [PATCH v21 060/100] c/r: define s390-specific checkpoint-restart code Oren Laadan
2010-05-01 14:15 ` [PATCH v21 061/100] c/r: capabilities: define checkpoint and restore fns Oren Laadan
2010-05-01 14:15 ` [PATCH v21 062/100] c/r: checkpoint and restore task credentials Oren Laadan
2010-05-01 14:15 ` [PATCH v21 063/100] c/r: restore file->f_cred Oren Laadan
2010-05-01 14:15 ` [PATCH v21 064/100] c/r: checkpoint and restore (shared) task's sighand_struct Oren Laadan
2010-05-01 14:15 ` [PATCH v21 065/100] c/r: [signal 1/4] blocked and template for shared signals Oren Laadan
2010-05-01 14:15 ` [PATCH v21 066/100] c/r: [signal 2/4] checkpoint/restart of rlimit Oren Laadan
2010-05-01 14:15 ` [PATCH v21 067/100] c/r: [signal 3/4] pending signals (private, shared) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 068/100] c/r: [signal 4/4] support for real/virt/prof itimers Oren Laadan
2010-05-01 14:15 ` [PATCH v21 069/100] Expose may_setuid() in user.h and add may_setgid() (v2) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 070/100] c/r: correctly restore pgid Oren Laadan
2010-05-01 14:15 ` [PATCH v21 071/100] Add common socket helpers to unify the security hooks Oren Laadan
2010-05-01 14:15 ` [PATCH v21 072/100] c/r: introduce checkpoint/restore methods to struct proto_ops Oren Laadan
2010-05-01 14:15 ` [PATCH v21 073/100] c/r: Add AF_UNIX support (v12) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 074/100] c/r: add support for listening INET sockets (v2) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 075/100] c/r: add support for connected INET sockets (v5) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 076/100] c/r: [pty 1/2] allow allocation of desired pty slave Oren Laadan
2010-05-01 14:15 ` [PATCH v21 077/100] c/r: [pty 2/2] support for pseudo terminals Oren Laadan
2010-05-01 14:16 ` [PATCH v21 078/100] c/r: support for controlling terminal and job control Oren Laadan
2010-05-01 14:16 ` [PATCH v21 079/100] c/r: checkpoint/restart epoll sets Oren Laadan
2010-05-01 14:16 ` [PATCH v21 080/100] c/r: checkpoint/restart eventfd Oren Laadan
2010-05-01 14:16 ` [PATCH v21 081/100] c/r: restore task fs_root and pwd (v3) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 082/100] c/r: preliminary support mounts namespace Oren Laadan
2010-05-01 14:16 ` [PATCH v21 083/100] c/r: nested pid namespaces (v3) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 084/100] powerpc: reserve checkpoint arch identifiers Oren Laadan
2010-05-01 14:16 ` [PATCH v21 085/100] powerpc: provide APIs for validating and updating DABR Oren Laadan
2010-05-01 14:16 ` [PATCH v21 086/100] powerpc: checkpoint/restart implementation Oren Laadan
2010-05-01 14:16 ` [PATCH v21 087/100] powerpc: wire up checkpoint and restart syscalls Oren Laadan
2010-05-01 14:16 ` [PATCH v21 088/100] powerpc: enable checkpoint support in Kconfig Oren Laadan
2010-05-01 14:16 ` [PATCH v21 089/100] c/r: add lsm name and lsm_info (policy header) to container info Oren Laadan
2010-05-01 14:16 ` [PATCH v21 090/100] c/r: add generic LSM c/r support (v7) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 091/100] c/r: add smack support to lsm c/r (v4) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 092/100] c/r: add selinux support (v6) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 093/100] c/r: Add checkpoint and collect hooks to net_device_ops Oren Laadan
2010-05-01 14:16 ` [PATCH v21 094/100] c/r: Basic support for network namespaces and devices (v6) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 095/100] c/r: Add rtnl_dellink() helper Oren Laadan
2010-05-01 14:16 ` [PATCH v21 096/100] c/r: Add checkpoint support for veth devices (v2) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 097/100] c/r: Add loopback checkpoint support (v2) Oren Laadan
2010-05-01 14:16 ` [PATCH v21 098/100] c/r: Add a checkpoint handler to the 'sit' device Oren Laadan
2010-05-01 14:16 ` [PATCH v21 099/100] c/r: Add checkpoint support to macvlan driver Oren Laadan
2010-05-01 14:16 ` [PATCH v21 100/100] c/r: add an entry for checkpoint/restart in MAINTAINERS Oren Laadan
2010-05-01 15:17 ` [PATCH v21 00/100] Kernel based checkpoint/restart Oren Laadan
2010-05-04 14:43 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page David Howells
2010-05-05 15:13   ` Oren Laadan
  -- strict thread matches above, loose matches on Subject: below --
2010-05-29 10:31 [PATCH v21 011/100] eclone (11/11): Document sys_eclone Albert Cahalan
2010-06-01 19:32 ` Sukadev Bhattiprolu
2010-06-01 19:59   ` Albert Cahalan
2010-06-02  1:38     ` Sukadev Bhattiprolu
2010-06-05 11:49       ` Albert Cahalan
2010-06-05 11:58       ` Albert Cahalan
2010-06-05 12:08       ` Albert Cahalan
2010-06-09 18:14         ` Sukadev Bhattiprolu
2010-06-09 18:46           ` H. Peter Anvin
2010-06-09 22:32           ` Roland McGrath
2010-06-10  9:15           ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).