* c/r: support for x86-64 arch
@ 2009-12-06 20:31 Oren Laadan
[not found] ` <1260131469-2917-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Oren Laadan @ 2009-12-06 20:31 UTC (permalink / raw)
To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Cc: Alexey Dobriyan, Louis Rilling, Dave Hansen
The following patches add experimental support for x86-64 arch. The
code is based on Alexey's submission from a while ago.
The basic case of 64bit process checkpoint/restart works. Other cases
such as 32bit processes checkpoint/restart on 64->64, 32->64 and also
64->32 are not tested. Nor is self-checkpoint.
Being far from an expert on x86-64, I collected bits and pieces from
other places in the kernel - so this needs a serious review, including:
- How load_cpu_regs() restores the task's current state - I tried to
follow similar work done by context switch code
- For self-checkpoint make sure we get the correct "running" state
from current registers (e.g. segments), not from ptregs.
The first patch relocates and splits current x86-32 code. The second
patch adds support for x86-64. The third patch provides the user-cr
eclone() wrapper based on Dave and Louis's work.
Oren.
^ permalink raw reply [flat|nested] 9+ messages in thread[parent not found: <1260131469-2917-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype [not found] ` <1260131469-2917-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-12-06 20:31 ` Oren Laadan [not found] ` <1260131469-2917-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Oren Laadan @ 2009-12-06 20:31 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: Alexey Dobriyan, Louis Rilling, Dave Hansen Similar to other select syscalls (fork, clone, execve), sys_restart needs to access the pt_regs structure, so that it can modify it to restore the original state from the time of the checkpoint. (This is less of an issue for x86-32, however is required for those architectures that otherwise save/restore partial state (e.g. not all registers) during syscall entry/exit, like x86-64. This patch prepares to support c/r on x86-64, specifically: * Changes the syscall prototype and definition to accept the pt_regs struct as an argument (into %eax register). * Move arch/x86/mm/checkpoint*.c to arch/x86/kernel/... * Split 32bit-dependent part of arch/x86/kernel/checkpoint.c into a new arch/x86/kernel/checkpoint_32.c Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- arch/x86/include/asm/syscalls.h | 5 + arch/x86/kernel/Makefile | 8 + arch/x86/{mm => kernel}/checkpoint.c | 293 +++++++++------------------------- arch/x86/kernel/checkpoint_32.c | 191 ++++++++++++++++++++++ arch/x86/kernel/entry_32.S | 3 + arch/x86/kernel/syscall_table_32.S | 2 +- arch/x86/mm/Makefile | 2 - checkpoint/sys.c | 5 +- include/linux/checkpoint.h | 2 + include/linux/syscalls.h | 2 - 10 files changed, 288 insertions(+), 225 deletions(-) rename arch/x86/{mm => kernel}/checkpoint.c (77%) create mode 100644 arch/x86/kernel/checkpoint_32.c diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 2cadb8e..1079447 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -43,6 +43,11 @@ int sys_clone(struct pt_regs *); int sys_eclone(struct pt_regs *); int sys_execve(struct pt_regs *); +/* kernel/checkpoint_32.c */ +#ifdef CONFIG_CHECKPOINT +long sys_restart(struct pt_regs *); +#endif + /* kernel/signal.c */ asmlinkage int sys_sigsuspend(int, int, old_sigset_t); asmlinkage int sys_sigaction(int, const struct old_sigaction __user *, diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index d8e5d0c..2821fd6 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -117,6 +117,14 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o +obj-$(CONFIG_CHECKPOINT) += checkpoint.o + +### +# 32 bit specific files +ifeq ($(CONFIG_X86_32),y) + obj-$(CONFIG_CHECKPOINT) += checkpoint_32.o +endif + ### # 64 bit specific files ifeq ($(CONFIG_X86_64),y) diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/kernel/checkpoint.c similarity index 77% rename from arch/x86/mm/checkpoint.c rename to arch/x86/kernel/checkpoint.c index 2752fdf..fbe9521 100644 --- a/arch/x86/mm/checkpoint.c +++ b/arch/x86/kernel/checkpoint.c @@ -18,59 +18,11 @@ #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> -/* - * helpers to encode/decode/validate registers/segments/eflags - */ - -static int check_eflags(__u32 eflags) -{ -#define X86_EFLAGS_CKPT_MASK \ - (X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \ - X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \ - X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID | X86_EFLAGS_RF) - - if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2)) - return 0; - return 1; -} - -static void restore_eflags(struct pt_regs *regs, __u32 eflags) -{ - /* - * A task may have had X86_EFLAGS_RF set at checkpoint, .e.g: - * 1) It ran in a KVM guest, and the guest was being debugged, - * 2) The kernel was debugged using kgbd, - * 3) From Intel's manual: "When calling an event handler, - * Intel 64 and IA-32 processors establish the value of the - * RF flag in the EFLAGS image pushed on the stack: - * - For any fault-class exception except a debug exception - * generated in response to an instruction breakpoint, the - * value pushed for RF is 1. - * - For any interrupt arriving after any iteration of a - * repeated string instruction but the last iteration, the - * value pushed for RF is 1. - * - For any trap-class exception generated by any iteration - * of a repeated string instruction but the last iteration, - * the value pushed for RF is 1. - * - For other cases, the value pushed for RF is the value - * that was in EFLAG.RF at the time the event handler was - * called. - * [from: http://www.intel.com/Assets/PDF/manual/253668.pdf] - * - * The RF flag may be set in EFLAGS by the hardware, or by - * kvm/kgdb, or even by the user with ptrace or by setting a - * suitable context when returning from a signal handler. - * - * Therefore, on restart we (1) prserve X86_EFLAGS_RF from - * checkpoint time, and (2) preserve a X86_EFLAGS_RF of the - * restarting process if it already exists on saved EFLAGS. - * Disable preemption to protect EFLAG test-and-change. - */ - preempt_disable(); - eflags |= (regs->flags & X86_EFLAGS_RF); - regs->flags = eflags; - preempt_enable(); -} +extern int check_segment(__u16 seg); +extern __u16 encode_segment(unsigned short seg); +extern unsigned short decode_segment(__u16 seg); +extern void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t); +extern int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t); static int check_tls(struct desc_struct *desc) { @@ -81,70 +33,6 @@ static int check_tls(struct desc_struct *desc) return 1; } -static int check_segment(__u16 seg) -{ - int ret = 0; - - switch (seg) { - case CKPT_X86_SEG_NULL: - case CKPT_X86_SEG_USER32_CS: - case CKPT_X86_SEG_USER32_DS: - return 1; - } - if (seg & CKPT_X86_SEG_TLS) { - seg &= ~CKPT_X86_SEG_TLS; - if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN) - ret = 1; - } else if (seg & CKPT_X86_SEG_LDT) { - seg &= ~CKPT_X86_SEG_LDT; - if (seg <= 0x1fff) - ret = 1; - } - return ret; -} - -static __u16 encode_segment(unsigned short seg) -{ - if (seg == 0) - return CKPT_X86_SEG_NULL; - BUG_ON((seg & 3) != 3); - - if (seg == __USER_CS) - return CKPT_X86_SEG_USER32_CS; - if (seg == __USER_DS) - return CKPT_X86_SEG_USER32_DS; - - if (seg & 4) - return CKPT_X86_SEG_LDT | (seg >> 3); - - seg >>= 3; - if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX) - return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN); - - printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg); - BUG(); -} - -static unsigned short decode_segment(__u16 seg) -{ - if (seg == CKPT_X86_SEG_NULL) - return 0; - if (seg == CKPT_X86_SEG_USER32_CS) - return __USER_CS; - if (seg == CKPT_X86_SEG_USER32_DS) - return __USER_DS; - - if (seg & CKPT_X86_SEG_TLS) { - seg &= ~CKPT_X86_SEG_TLS; - return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3; - } - if (seg & CKPT_X86_SEG_LDT) { - seg &= ~CKPT_X86_SEG_LDT; - return (seg << 3) | 7; - } - BUG(); -} - #define CKPT_X86_TIF_UNSUPPORTED (_TIF_SECCOMP | _TIF_IO_BITMAP) /************************************************************************** @@ -153,10 +41,12 @@ static unsigned short decode_segment(__u16 seg) static int may_checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t) { +#ifdef CONFIG_X86_32 if (t->thread.vm86_info) { ckpt_err(ctx, -EBUSY, "%(T)Task in VM86 mode\n"); return -EBUSY; } +#endif if (task_thread_info(t)->flags & CKPT_X86_TIF_UNSUPPORTED) { ckpt_err(ctx, -EBUSY, "%(T)Bad thread info flags %#lx\n", task_thread_info(t)->flags); @@ -195,64 +85,10 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } -#ifdef CONFIG_X86_32 - -static void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) -{ - struct thread_struct *thread = &t->thread; - struct pt_regs *regs = task_pt_regs(t); - unsigned long _gs; - - h->bp = regs->bp; - h->bx = regs->bx; - h->ax = regs->ax; - h->cx = regs->cx; - h->dx = regs->dx; - h->si = regs->si; - h->di = regs->di; - h->orig_ax = regs->orig_ax; - h->ip = regs->ip; - - h->flags = regs->flags; - h->sp = regs->sp; - - h->cs = encode_segment(regs->cs); - h->ss = encode_segment(regs->ss); - h->ds = encode_segment(regs->ds); - h->es = encode_segment(regs->es); - - /* - * for checkpoint in process context (from within a container) - * the GS segment register should be saved from the hardware; - * otherwise it is already saved on the thread structure - */ - if (t == current) - _gs = get_user_gs(regs); - else - _gs = thread->gs; - - h->fsindex = encode_segment(regs->fs); - h->gsindex = encode_segment(_gs); - - /* - * for checkpoint in process context (from within a container), - * the actual syscall is taking place at this very moment; so - * we (optimistically) subtitute the future return value (0) of - * this syscall into the orig_eax, so that upon restart it will - * succeed (or it will endlessly retry checkpoint...) - */ - if (t == current) { - BUG_ON(h->orig_ax < 0); - h->ax = 0; - } -} - static void save_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t) { struct thread_struct *thread = &t->thread; - /* debug regs */ - /* * for checkpoint in process context (from within a container), * get the actual registers; otherwise get the saved values. @@ -315,8 +151,6 @@ static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } -#endif /* CONFIG_X86_32 */ - /* dump the cpu state and registers of a given task */ int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t) { @@ -438,6 +272,13 @@ int restore_thread(struct ckpt_ctx *ctx) load_TLS(thread, cpu); put_cpu(); +#if defined(CONFIG_X86_64) && defined(CONFIG_COMPAT) + if (h->thread_info_flags & _TIF_IA32) + set_thread_flag(TIF_IA32); + else + clear_thread_flag(TIF_IA32); +#endif + /* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */ ret = 0; @@ -446,49 +287,6 @@ int restore_thread(struct ckpt_ctx *ctx) return ret; } -#ifdef CONFIG_X86_32 - -static int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) -{ - struct thread_struct *thread = &t->thread; - struct pt_regs *regs = task_pt_regs(t); - - if (!check_eflags(h->flags)) - return -EINVAL; - if (h->cs == CKPT_X86_SEG_NULL) - return -EINVAL; - if (!check_segment(h->cs) || !check_segment(h->ds) || - !check_segment(h->es) || !check_segment(h->ss) || - !check_segment(h->fsindex) || !check_segment(h->gsindex)) - return -EINVAL; - - regs->bp = h->bp; - regs->bx = h->bx; - regs->ax = h->ax; - regs->cx = h->cx; - regs->dx = h->dx; - regs->si = h->si; - regs->di = h->di; - regs->orig_ax = h->orig_ax; - regs->ip = h->ip; - - restore_eflags(regs, h->flags); - regs->sp = h->sp; - - regs->ds = decode_segment(h->ds); - regs->es = decode_segment(h->es); - regs->cs = decode_segment(h->cs); - regs->ss = decode_segment(h->ss); - - regs->fs = decode_segment(h->fsindex); - regs->gs = decode_segment(h->gsindex); - - thread->gs = regs->gs; - lazy_load_gs(regs->gs); - - return 0; -} - static int load_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t) { int ret; @@ -548,7 +346,65 @@ static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } -#endif /* CONFIG_X86_32 */ +static int check_eflags(__u32 eflags) +{ +#define X86_EFLAGS_CKPT_MASK \ + (X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \ + X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \ + X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID | X86_EFLAGS_RF) + + if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2)) + return 0; + return 1; +} + +static void restore_eflags(struct pt_regs *regs, __u32 eflags) +{ + /* + * A task may have had X86_EFLAGS_RF set at checkpoint, .e.g: + * 1) It ran in a KVM guest, and the guest was being debugged, + * 2) The kernel was debugged using kgbd, + * 3) From Intel's manual: "When calling an event handler, + * Intel 64 and IA-32 processors establish the value of the + * RF flag in the EFLAGS image pushed on the stack: + * - For any fault-class exception except a debug exception + * generated in response to an instruction breakpoint, the + * value pushed for RF is 1. + * - For any interrupt arriving after any iteration of a + * repeated string instruction but the last iteration, the + * value pushed for RF is 1. + * - For any trap-class exception generated by any iteration + * of a repeated string instruction but the last iteration, + * the value pushed for RF is 1. + * - For other cases, the value pushed for RF is the value + * that was in EFLAG.RF at the time the event handler was + * called. + * [from: http://www.intel.com/Assets/PDF/manual/253668.pdf] + * + * The RF flag may be set in EFLAGS by the hardware, or by + * kvm/kgdb, or even by the user with ptrace or by setting a + * suitable context when returning from a signal handler. + * + * Therefore, on restart we (1) prserve X86_EFLAGS_RF from + * checkpoint time, and (2) preserve a X86_EFLAGS_RF of the + * restarting process if it already exists on saved EFLAGS. + * Disable preemption to protect EFLAG test-and-change. + */ + preempt_disable(); + eflags |= (regs->flags & X86_EFLAGS_RF); + regs->flags = eflags; + preempt_enable(); +} + +static int load_cpu_eflags(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct pt_regs *regs = task_pt_regs(t); + + if (!check_eflags(h->flags)) + return -EINVAL; + restore_eflags(regs, h->flags); + return 0; +} /* read the cpu state and registers for the current task */ int restore_cpu(struct ckpt_ctx *ctx) @@ -566,6 +422,9 @@ int restore_cpu(struct ckpt_ctx *ctx) ret = load_cpu_regs(h, t); if (ret < 0) goto out; + ret = load_cpu_eflags(h, t); + if (ret < 0) + goto out; ret = load_cpu_debug(h, t); if (ret < 0) goto out; diff --git a/arch/x86/kernel/checkpoint_32.c b/arch/x86/kernel/checkpoint_32.c new file mode 100644 index 0000000..d5ea6a0 --- /dev/null +++ b/arch/x86/kernel/checkpoint_32.c @@ -0,0 +1,191 @@ +/* + * Checkpoint/restart - architecture specific support for x86_32 + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <asm/desc.h> +#include <asm/i387.h> +#include <asm/elf.h> + +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +/* + * sys_restart needs to access and modify the pt_regs structure to + * restore the original state from the time of the checkpoint. + */ +long sys_restart(struct pt_regs *regs) +{ + unsigned long flags; + int fd, logfd; + pid_t pid; + + pid = regs->bx; + fd = regs->cx; + flags = regs->dx; + logfd = regs->di; + + return do_sys_restart(pid, fd, flags, logfd); +} + +/* helpers to encode/decode/validate segments */ + +static int check_segment(__u16 seg) +{ + int ret = 0; + + switch (seg) { + case CKPT_X86_SEG_NULL: + case CKPT_X86_SEG_USER32_CS: + case CKPT_X86_SEG_USER32_DS: + return 1; + } + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN) + ret = 1; + } else if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + if (seg <= 0x1fff) + ret = 1; + } + return ret; +} + +static __u16 encode_segment(unsigned short seg) +{ + if (seg == 0) + return CKPT_X86_SEG_NULL; + BUG_ON((seg & 3) != 3); + + if (seg == __USER_CS) + return CKPT_X86_SEG_USER32_CS; + if (seg == __USER_DS) + return CKPT_X86_SEG_USER32_DS; + + if (seg & 4) + return CKPT_X86_SEG_LDT | (seg >> 3); + + seg >>= 3; + if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX) + return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN); + + printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg); + BUG(); +} + +static unsigned short decode_segment(__u16 seg) +{ + if (seg == CKPT_X86_SEG_NULL) + return 0; + if (seg == CKPT_X86_SEG_USER32_CS) + return __USER_CS; + if (seg == CKPT_X86_SEG_USER32_DS) + return __USER_DS; + + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3; + } + if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + return (seg << 3) | 7; + } + BUG(); +} + +void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct thread_struct *thread = &t->thread; + struct pt_regs *regs = task_pt_regs(t); + unsigned long _gs; + + h->bp = regs->bp; + h->bx = regs->bx; + h->ax = regs->ax; + h->cx = regs->cx; + h->dx = regs->dx; + h->si = regs->si; + h->di = regs->di; + h->orig_ax = regs->orig_ax; + h->ip = regs->ip; + + h->flags = regs->flags; + h->sp = regs->sp; + + h->cs = encode_segment(regs->cs); + h->ss = encode_segment(regs->ss); + h->ds = encode_segment(regs->ds); + h->es = encode_segment(regs->es); + + /* + * for checkpoint in process context (from within a container) + * the GS segment register should be saved from the hardware; + * otherwise it is already saved on the thread structure + */ + if (t == current) + _gs = get_user_gs(regs); + else + _gs = thread->gs; + + h->fsindex = encode_segment(regs->fs); + h->gsindex = encode_segment(_gs); + + /* + * for checkpoint in process context (from within a container), + * the actual syscall is taking place at this very moment; so + * we (optimistically) subtitute the future return value (0) of + * this syscall into the orig_eax, so that upon restart it will + * succeed (or it will endlessly retry checkpoint...) + */ + if (t == current) { + BUG_ON(h->orig_ax < 0); + h->ax = 0; + } +} + +int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct thread_struct *thread = &t->thread; + struct pt_regs *regs = task_pt_regs(t); + + if (h->cs == CKPT_X86_SEG_NULL) + return -EINVAL; + if (!check_segment(h->cs) || !check_segment(h->ds) || + !check_segment(h->es) || !check_segment(h->ss) || + !check_segment(h->fsindex) || !check_segment(h->gsindex)) + return -EINVAL; + + regs->bp = h->bp; + regs->bx = h->bx; + regs->ax = h->ax; + regs->cx = h->cx; + regs->dx = h->dx; + regs->si = h->si; + regs->di = h->di; + regs->orig_ax = h->orig_ax; + regs->ip = h->ip; + + regs->sp = h->sp; + + regs->ds = decode_segment(h->ds); + regs->es = decode_segment(h->es); + regs->cs = decode_segment(h->cs); + regs->ss = decode_segment(h->ss); + + regs->fs = decode_segment(h->fsindex); + regs->gs = decode_segment(h->gsindex); + + thread->gs = regs->gs; + lazy_load_gs(regs->gs); + + return 0; +} diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 7e7f3c8..ecefd09 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -726,6 +726,9 @@ PTREGSCALL(sigreturn) PTREGSCALL(rt_sigreturn) PTREGSCALL(vm86) PTREGSCALL(vm86old) +#ifdef CONFIG_CHECKPOINT +PTREGSCALL(restart) +#endif .macro FIXUP_ESPFIX_STACK /* diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index a1bc7f7..1ca053e 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -338,4 +338,4 @@ ENTRY(sys_call_table) .long sys_perf_event_open .long ptregs_eclone .long sys_checkpoint - .long sys_restart + .long ptregs_restart diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 735c0b2..06630d2 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -26,5 +26,3 @@ obj-$(CONFIG_K8_NUMA) += k8topology_64.o obj-$(CONFIG_ACPI_NUMA) += srat_$(BITS).o obj-$(CONFIG_MEMTEST) += memtest.o - -obj-$(CONFIG_CHECKPOINT) += checkpoint.o diff --git a/checkpoint/sys.c b/checkpoint/sys.c index afcfa1e..89056d6 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -648,7 +648,7 @@ SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd, } /** - * sys_restart - restart a container + * do_sys_restart - restart a container * @pid: pid of task root (in coordinator's namespace), or 0 * @fd: file from which read the checkpoint image * @flags: restart operation flags @@ -657,8 +657,7 @@ SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd, * Returns negative value on error, or otherwise returns in the realm * of the original checkpoint */ -SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, - unsigned long, flags, int, logfd) +long do_sys_restart(pid_t pid, int fd, unsigned long flags, int logfd) { struct ckpt_ctx *ctx = NULL; long ret; diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index c6c8d56..d81c59c 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -60,6 +60,8 @@ #define CKPT_LSM_INFO_LEN 200 #define CKPT_LSM_STRING_MAX 1024 +extern long do_sys_restart(pid_t pid, int fd, unsigned long flags, int logfd); + extern int walk_task_subtree(struct task_struct *task, int (*func)(struct task_struct *, void *), void *data); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 9ed192f..264a02e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -874,8 +874,6 @@ asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int, size_t); asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags, int logfd); -asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags, - int logfd); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
[parent not found: <1260131469-2917-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* [PATCH] user-cr: eclone x86-64 wrapper [not found] ` <1260131469-2917-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-12-06 20:31 ` Oren Laadan [not found] ` <1260131469-2917-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2009-12-06 22:51 ` [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype Oren Laadan 1 sibling, 1 reply; 9+ messages in thread From: Oren Laadan @ 2009-12-06 20:31 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: Alexey Dobriyan, Louis Rilling, Dave Hansen Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- clone_x86_64.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 88 insertions(+), 0 deletions(-) create mode 100644 clone_x86_64.c diff --git a/clone_x86_64.c b/clone_x86_64.c new file mode 100644 index 0000000..d6d7e6f --- /dev/null +++ b/clone_x86_64.c @@ -0,0 +1,88 @@ +/* + * clone_x86_64.c: support for eclone() on x86_64 + * + * Copyright (C) Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> + * Copyright (C) Dave Hansen <daveh-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#define _GNU_SOURCE + +#include <unistd.h> +#include <errno.h> +#include <sys/types.h> +#include <sys/syscall.h> +#include <asm/unistd.h> + +/* + * libc doesn't support eclone() yet... + * below is arch-dependent code to use the syscall + */ +#include <linux/checkpoint.h> + +#include "eclone.h" + +#ifndef __NR_eclone +#define __NR_eclone 299 +#endif + +int eclone(int (*fn)(void *), void *fn_arg, int clone_flags_low, + struct clone_args *clone_args, pid_t *pids) +{ + struct clone_args my_args; + long retval; + void **newstack; + + if (clone_args->child_stack) { + /* + * Set up the stack for child: + * - fn_arg will be the argument for the child function + * - the fn pointer will be loaded into ebx after the clone + */ + newstack = (void **)(unsigned long)(clone_args->child_stack + + clone_args->child_stack_size); + *--newstack = fn_arg; + *--newstack = fn; + } else + newstack = (void **)0; + + my_args = *clone_args; + my_args.child_stack = (unsigned long)newstack; + my_args.child_stack_size = 0; + + __asm__ __volatile__( + "movq %6, %%r10\n\t" /* pids in r10*/ + "syscall\n\t" /* Linux/x86_64 system call */ + "testq %0,%0\n\t" /* check return value */ + "jne 1f\n\t" /* jump if parent */ + "popq %%rax\n\t" /* get subthread function */ + "popq %%rdi\n\t" /* get the subthread function arg */ + "call *%%rax\n\t" /* start subthread function */ + "movq %2,%0\n\t" + "syscall\n" /* exit system call: exit subthread */ + "1:\n\t" + :"=a" (retval) + :"0" (__NR_eclone), "i" (__NR_exit), + "D" (clone_flags_low), /* rdi */ + "S" (&my_args), /* rsi */ + "d" (sizeof(my_args)), /* rdx */ + "m" (pids) /* gets moved to r10 */ + :"rcx", "r10", "r11", "cc" + ); + /* + * glibc lists 'cc' as clobbered, so we might as + * well do it too. 'r11' and 'rcx' are clobbered + * by the 'syscall' instruction itself. 'r8' and + * 'r9' are clobbered by the clone, but that + * thread will exit before getting back out to C. + */ + + if (retval < 0) { + errno = -retval; + retval = -1; + } + return retval; +} -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
[parent not found: <1260131469-2917-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* [PATCH 2/2] c/r: x86-64: checkpoint/restart implementation [not found] ` <1260131469-2917-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-12-06 20:31 ` Oren Laadan 2009-12-06 20:35 ` [PATCH] user-cr: eclone x86-64 wrapper Oren Laadan 1 sibling, 0 replies; 9+ messages in thread From: Oren Laadan @ 2009-12-06 20:31 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: Alexey Dobriyan, Louis Rilling, Dave Hansen Support for checkpoint and restart for X86_32 architecture. Partly based on Alexey's work. Checkpoint Restart (app/arch) (app/arch) -------------------------------- 64/x86-64 -> 64/x86-64 works 32/x86-64 -> 32/x86-64 ? 32/x86-64 -> 32/x86-32 ? 32/x86-32 -> 32/x86-64 ? Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- arch/x86/Kconfig | 2 +- arch/x86/include/asm/checkpoint_hdr.h | 6 + arch/x86/include/asm/syscalls.h | 6 + arch/x86/include/asm/unistd_64.h | 4 + arch/x86/kernel/Makefile | 2 + arch/x86/kernel/checkpoint_64.c | 251 +++++++++++++++++++++++++++++++++ arch/x86/kernel/entry_64.S | 5 + include/linux/checkpoint_hdr.h | 2 + 8 files changed, 277 insertions(+), 1 deletions(-) create mode 100644 arch/x86/kernel/checkpoint_64.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 69d6077..f6260f5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -88,7 +88,7 @@ config HAVE_LATENCYTOP_SUPPORT config CHECKPOINT_SUPPORT bool - default y if X86_32 + default y config MMU def_bool y diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h index 65511ca..0033bfe 100644 --- a/arch/x86/include/asm/checkpoint_hdr.h +++ b/arch/x86/include/asm/checkpoint_hdr.h @@ -36,6 +36,10 @@ #include <asm/processor.h> #endif +#ifdef CONFIG_X86_64 +#define CKPT_ARCH_ID CKPT_ARCH_X86_64 +#endif + #ifdef CONFIG_X86_32 #define CKPT_ARCH_ID CKPT_ARCH_X86_32 #endif @@ -135,6 +139,8 @@ struct ckpt_hdr_cpu { #define CKPT_X86_SEG_NULL 0 #define CKPT_X86_SEG_USER32_CS 1 #define CKPT_X86_SEG_USER32_DS 2 +#define CKPT_X86_SEG_USER64_CS 3 +#define CKPT_X86_SEG_USER64_DS 4 #define CKPT_X86_SEG_TLS 0x4000 /* 0100 0000 0000 00xx */ #define CKPT_X86_SEG_LDT 0x8000 /* 100x xxxx xxxx xxxx */ diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 1079447..063cdd0 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -88,6 +88,12 @@ asmlinkage long sys_execve(char __user *, char __user * __user *, struct pt_regs *); long sys_arch_prctl(int, unsigned long); +/* kernel/checkpoint_64.c */ +#ifdef CONFIG_CHECKPOINT +asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd, + struct pt_regs *regs); +#endif + /* kernel/signal.c */ asmlinkage long sys_sigaltstack(const stack_t __user *, stack_t __user *, struct pt_regs *); diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index d2ffc89..c360707 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo) __SYSCALL(__NR_perf_event_open, sys_perf_event_open) #define __NR_eclone 299 __SYSCALL(__NR_eclone, stub_eclone) +#define __NR_checkpoint 300 +__SYSCALL(__NR_checkpoint, sys_checkpoint) +#define __NR_restart 301 +__SYSCALL(__NR_restart, stub_restart) #ifndef __NO_STUBS diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 2821fd6..ded0ee2 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -138,4 +138,6 @@ ifeq ($(CONFIG_X86_64),y) obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o obj-y += vsmp_64.o + + obj-$(CONFIG_CHECKPOINT) += checkpoint_64.o endif diff --git a/arch/x86/kernel/checkpoint_64.c b/arch/x86/kernel/checkpoint_64.c new file mode 100644 index 0000000..3901a53 --- /dev/null +++ b/arch/x86/kernel/checkpoint_64.c @@ -0,0 +1,251 @@ +/* + * Checkpoint/restart - architecture specific support for x86_64 + * + * Copyright (C) 2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DSYS + +#include <asm/desc.h> +#include <asm/i387.h> +#include <asm/elf.h> + +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + +/* + * sys_restart needs to access and modify the pt_regs structure to + * restore the original state from the time of the checkpoint. + */ +asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd, + struct pt_regs *regs) +{ + return do_sys_restart(pid, fd, flags, logfd); +} + +/* helpers to encode/decode/validate segments */ + +int check_segment(__u16 seg) +{ + int ret = 0; + + switch (seg) { + case CKPT_X86_SEG_NULL: + case CKPT_X86_SEG_USER64_CS: + case CKPT_X86_SEG_USER64_DS: +#ifdef CONFIG_COMPAT + case CKPT_X86_SEG_USER32_CS: + case CKPT_X86_SEG_USER32_DS: +#endif + return 1; + } + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN) + ret = 1; + } else if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + if (seg <= 0x1fff) + ret = 1; + } + return ret; +} + +__u16 encode_segment(unsigned short seg) +{ + if (seg == 0) + return CKPT_X86_SEG_NULL; + BUG_ON((seg & 3) != 3); + + if (seg == __USER_CS) + return CKPT_X86_SEG_USER64_CS; + if (seg == __USER_DS) + return CKPT_X86_SEG_USER64_DS; +#ifdef CONFIG_COMPAT + if (seg == __USER32_CS) + return CKPT_X86_SEG_USER32_CS; + if (seg == __USER32_DS) + return CKPT_X86_SEG_USER32_DS; +#endif + + if (seg & 4) + return CKPT_X86_SEG_LDT | (seg >> 3); + + seg >>= 3; + if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX) + return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN); + + printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg); + BUG(); +} + +unsigned short decode_segment(__u16 seg) +{ + if (seg == CKPT_X86_SEG_NULL) + return 0; + + if (seg == CKPT_X86_SEG_USER64_CS) + return __USER_CS; + if (seg == CKPT_X86_SEG_USER64_DS) + return __USER_DS; +#ifdef CONFIG_COMPAT + if (seg == CKPT_X86_SEG_USER32_CS) + return __USER32_CS; + if (seg == CKPT_X86_SEG_USER32_DS) + return __USER32_DS; +#endif + + if (seg & CKPT_X86_SEG_TLS) { + seg &= ~CKPT_X86_SEG_TLS; + return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3; + } + if (seg & CKPT_X86_SEG_LDT) { + seg &= ~CKPT_X86_SEG_LDT; + return (seg << 3) | 7; + } + BUG(); +} + +void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct pt_regs *regs = task_pt_regs(t); + unsigned long _ds, _es, _fs, _gs; + + h->r15 = regs->r15; + h->r14 = regs->r14; + h->r13 = regs->r13; + h->r12 = regs->r12; + h->r11 = regs->r11; + h->r10 = regs->r10; + h->r9 = regs->r9; + h->r8 = regs->r8; + + h->bp = regs->bp; + h->bx = regs->bx; + h->ax = regs->ax; + h->cx = regs->cx; + h->dx = regs->dx; + h->si = regs->si; + h->di = regs->di; + h->orig_ax = regs->orig_ax; + h->ip = regs->ip; + + h->flags = regs->flags; + h->sp = regs->sp; + + /* + * for checkpoint in process context (from within a container) + * DS, ES, FS, GS registers should be saved from the hardware; + * otherwise they are already saved on the thread structure + */ + + h->cs = encode_segment(regs->cs); + h->ss = encode_segment(regs->ss); + + if (t == current) { + savesegment(ds, _ds); + savesegment(es, _es); + savesegment(fs, _fs); + savesegment(gs, _gs); + } else { + _ds = t->thread.ds; + _es = t->thread.es; + _fs = t->thread.fsindex; + _gs = t->thread.gsindex; + } + h->ds = encode_segment(_ds); + h->es = encode_segment(_es); + h->fsindex = encode_segment(_fs); + h->gsindex = encode_segment(_gs); + + if (!test_tsk_thread_flag(t, TIF_IA32)) { + h->fs = t->thread.fs; + h->gs = t->thread.gs; + } + + /* + * for checkpoint in process context (from within a container), + * the actual syscall is taking place at this very moment; so + * we (optimistically) subtitute the future return value (0) of + * this syscall into the orig_eax, so that upon restart it will + * succeed (or it will endlessly retry checkpoint...) + */ + if (t == current) { + BUG_ON(h->orig_ax < 0); + h->ax = 0; + } +} + +int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t) +{ + struct thread_struct *thread = &t->thread; + struct pt_regs *regs = task_pt_regs(t); + + if (h->cs == CKPT_X86_SEG_NULL) + return -EINVAL; + if (!check_segment(h->cs) || !check_segment(h->ds) || + !check_segment(h->es) || !check_segment(h->ss) || + !check_segment(h->fsindex) || !check_segment(h->gsindex)) + return -EINVAL; + +#ifdef CONFIG_COMPAT + if (test_tsk_thread_flag(t, TIF_IA32) && + (!check_segment(h->fs) || !check_segment(h->gs))) + return -EINVAL; +#endif + + regs->r15 = h->r15; + regs->r14 = h->r14; + regs->r13 = h->r13; + regs->r12 = h->r12; + regs->r11 = h->r11; + regs->r10 = h->r10; + regs->r9 = h->r9; + regs->r8 = h->r8; + + regs->bp = h->bp; + regs->bx = h->bx; + regs->ax = h->ax; + regs->cx = h->cx; + regs->dx = h->dx; + regs->si = h->si; + regs->di = h->di; + regs->orig_ax = h->orig_ax; + regs->ip = h->ip; + + regs->sp = h->sp; + thread->usersp = h->sp; + + preempt_disable(); + + regs->cs = decode_segment(h->cs); + regs->ss = decode_segment(h->ss); + thread->ds = decode_segment(h->ds); + thread->es = decode_segment(h->es); + thread->fsindex = decode_segment(h->fsindex); + thread->gsindex = decode_segment(h->gsindex); + +#ifdef CONFIG_COMPAT + if (!test_tsk_thread_flag(t, TIF_IA32)) { + thread->fs = h->fs; + thread->gs = h->gs; + } +#endif + + /* XXX - unsure is this really needed ... */ + loadsegment(fs, thread->fsindex); + if (thread->fs) + wrmsrl(MSR_FS_BASE, thread->fs); + load_gs_index(thread->gsindex); + if (thread->gs) + wrmsrl(MSR_KERNEL_GS_BASE, thread->gs); + + preempt_enable(); + + return 0; +} diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 6d60cd1..e692193 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -699,6 +699,11 @@ END(\label) PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx PTREGSCALL stub_iopl, sys_iopl, %rsi PTREGSCALL stub_eclone, sys_eclone, %r8 +#ifdef CONFIG_CHECKPOINT + PTREGSCALL stub_restart, sys_restart, %r8 +#else + PTREGSCALL stub_restart, sys_ni_syscall, %r8 +#endif ENTRY(ptregscall_common) DEFAULT_FRAME 1 8 /* offset 8: return address */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 4e57d37..6468fa9 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -195,6 +195,8 @@ enum { #define CKPT_ARCH_PPC32 CKPT_ARCH_PPC32 CKPT_ARCH_PPC64, #define CKPT_ARCH_PPC64 CKPT_ARCH_PPC64 + CKPT_ARCH_X86_64, +#define CKPT_ARCH_X86_64 CKPT_ARCH_X86_64 }; /* shared objrects (objref) */ -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] user-cr: eclone x86-64 wrapper [not found] ` <1260131469-2917-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2009-12-06 20:31 ` [PATCH 2/2] c/r: x86-64: checkpoint/restart implementation Oren Laadan @ 2009-12-06 20:35 ` Oren Laadan 1 sibling, 0 replies; 9+ messages in thread From: Oren Laadan @ 2009-12-06 20:35 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: Louis Rilling, Alexey Dobriyan, Dave Hansen To test this, you need to update the kernel headers for user-cr $ scripts/extract_headers -s PATH_TO_CR_KERNEL Oren. Oren Laadan wrote: > Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > --- > clone_x86_64.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 88 insertions(+), 0 deletions(-) > create mode 100644 clone_x86_64.c > > diff --git a/clone_x86_64.c b/clone_x86_64.c > new file mode 100644 > index 0000000..d6d7e6f > --- /dev/null > +++ b/clone_x86_64.c > @@ -0,0 +1,88 @@ > +/* > + * clone_x86_64.c: support for eclone() on x86_64 > + * > + * Copyright (C) Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > + * Copyright (C) Dave Hansen <daveh-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > + * > + * This file is subject to the terms and conditions of the GNU General Public > + * License. See the file COPYING in the main directory of the Linux > + * distribution for more details. > + */ > + > +#define _GNU_SOURCE > + > +#include <unistd.h> > +#include <errno.h> > +#include <sys/types.h> > +#include <sys/syscall.h> > +#include <asm/unistd.h> > + > +/* > + * libc doesn't support eclone() yet... > + * below is arch-dependent code to use the syscall > + */ > +#include <linux/checkpoint.h> > + > +#include "eclone.h" > + > +#ifndef __NR_eclone > +#define __NR_eclone 299 > +#endif > + > +int eclone(int (*fn)(void *), void *fn_arg, int clone_flags_low, > + struct clone_args *clone_args, pid_t *pids) > +{ > + struct clone_args my_args; > + long retval; > + void **newstack; > + > + if (clone_args->child_stack) { > + /* > + * Set up the stack for child: > + * - fn_arg will be the argument for the child function > + * - the fn pointer will be loaded into ebx after the clone > + */ > + newstack = (void **)(unsigned long)(clone_args->child_stack + > + clone_args->child_stack_size); > + *--newstack = fn_arg; > + *--newstack = fn; > + } else > + newstack = (void **)0; > + > + my_args = *clone_args; > + my_args.child_stack = (unsigned long)newstack; > + my_args.child_stack_size = 0; > + > + __asm__ __volatile__( > + "movq %6, %%r10\n\t" /* pids in r10*/ > + "syscall\n\t" /* Linux/x86_64 system call */ > + "testq %0,%0\n\t" /* check return value */ > + "jne 1f\n\t" /* jump if parent */ > + "popq %%rax\n\t" /* get subthread function */ > + "popq %%rdi\n\t" /* get the subthread function arg */ > + "call *%%rax\n\t" /* start subthread function */ > + "movq %2,%0\n\t" > + "syscall\n" /* exit system call: exit subthread */ > + "1:\n\t" > + :"=a" (retval) > + :"0" (__NR_eclone), "i" (__NR_exit), > + "D" (clone_flags_low), /* rdi */ > + "S" (&my_args), /* rsi */ > + "d" (sizeof(my_args)), /* rdx */ > + "m" (pids) /* gets moved to r10 */ > + :"rcx", "r10", "r11", "cc" > + ); > + /* > + * glibc lists 'cc' as clobbered, so we might as > + * well do it too. 'r11' and 'rcx' are clobbered > + * by the 'syscall' instruction itself. 'r8' and > + * 'r9' are clobbered by the clone, but that > + * thread will exit before getting back out to C. > + */ > + > + if (retval < 0) { > + errno = -retval; > + retval = -1; > + } > + return retval; > +} ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype [not found] ` <1260131469-2917-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2009-12-06 20:31 ` [PATCH] user-cr: eclone x86-64 wrapper Oren Laadan @ 2009-12-06 22:51 ` Oren Laadan [not found] ` <4B1C357C.2090003-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 1 sibling, 1 reply; 9+ messages in thread From: Oren Laadan @ 2009-12-06 22:51 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Nathan Lynch Oren Laadan wrote: > Similar to other select syscalls (fork, clone, execve), sys_restart > needs to access the pt_regs structure, so that it can modify it to > restore the original state from the time of the checkpoint. > > (This is less of an issue for x86-32, however is required for those > architectures that otherwise save/restore partial state (e.g. not all > registers) during syscall entry/exit, like x86-64. > > This patch prepares to support c/r on x86-64, specifically: > > * Changes the syscall prototype and definition to accept the pt_regs > struct as an argument (into %eax register). I forgot to mention that this of course breaks s390 and ppc: you need to provide an arch-dependent sys_restart() similar to how it's done here. Oren. > > * Move arch/x86/mm/checkpoint*.c to arch/x86/kernel/... > > * Split 32bit-dependent part of arch/x86/kernel/checkpoint.c into a > new arch/x86/kernel/checkpoint_32.c > > Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > --- ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <4B1C357C.2090003-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype [not found] ` <4B1C357C.2090003-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-12-07 20:55 ` Nathan Lynch [not found] ` <1260219307.7151.3.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Nathan Lynch @ 2009-12-07 20:55 UTC (permalink / raw) To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA > > * Changes the syscall prototype and definition to accept the pt_regs > > struct as an argument (into %eax register). > > I forgot to mention that this of course breaks s390 and ppc: you > need to provide an arch-dependent sys_restart() similar to how it's > done here. Thanks, here's the fixup for powerpc. From 981dca4f3a879827d6e19a0cf32c7fd25b08a878 Mon Sep 17 00:00:00 2001 From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> Date: Mon, 7 Dec 2009 14:51:13 -0600 Subject: [PATCH] checkpoint/powerpc: fix up restart code for ptregscall semantics Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> --- arch/powerpc/kernel/process.c | 20 ++++++++++++++++++++ 1 files changed, 20 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 457c269..f9da9eb 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -30,6 +30,7 @@ #include <linux/init_task.h> #include <linux/module.h> #include <linux/kallsyms.h> +#include <linux/checkpoint.h> #include <linux/mqueue.h> #include <linux/hardirq.h> #include <linux/utsname.h> @@ -990,6 +991,25 @@ out: return error; } +int sys_restart(unsigned long a0, unsigned long a1, unsigned long a2, + unsigned long a3, unsigned long a4, unsigned long a5, + struct pt_regs *regs) +{ + unsigned long flags; + pid_t pid; + int logfd; + int fd; + + CHECK_FULL_REGS(regs); + + pid = a0; + fd = a1; + flags = a2; + logfd = a3; + + return do_sys_restart(pid, fd, flags, logfd); +} + #ifdef CONFIG_IRQSTACKS static inline int valid_irq_stack(unsigned long sp, struct task_struct *p, unsigned long nbytes) -- 1.6.0.6 ^ permalink raw reply related [flat|nested] 9+ messages in thread
[parent not found: <1260219307.7151.3.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype [not found] ` <1260219307.7151.3.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2009-12-09 16:52 ` Serge E. Hallyn 2009-12-09 17:02 ` Serge E. Hallyn 1 sibling, 0 replies; 9+ messages in thread From: Serge E. Hallyn @ 2009-12-09 16:52 UTC (permalink / raw) To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): > > > > * Changes the syscall prototype and definition to accept the pt_regs > > > struct as an argument (into %eax register). > > > > I forgot to mention that this of course breaks s390 and ppc: you > > need to provide an arch-dependent sys_restart() similar to how it's > > done here. > > Thanks, here's the fixup for powerpc. Does this need to be in a #ifdef CONFIG_CHECKPOINT? Near as I can tell there is no dummy do_sys_restart() for the CONFIG_CHECKPOINT=n case. > >From 981dca4f3a879827d6e19a0cf32c7fd25b08a878 Mon Sep 17 00:00:00 2001 > From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> > Date: Mon, 7 Dec 2009 14:51:13 -0600 > Subject: [PATCH] checkpoint/powerpc: fix up restart code for ptregscall semantics > > Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> > --- > arch/powerpc/kernel/process.c | 20 ++++++++++++++++++++ > 1 files changed, 20 insertions(+), 0 deletions(-) > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index 457c269..f9da9eb 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -30,6 +30,7 @@ > #include <linux/init_task.h> > #include <linux/module.h> > #include <linux/kallsyms.h> > +#include <linux/checkpoint.h> > #include <linux/mqueue.h> > #include <linux/hardirq.h> > #include <linux/utsname.h> > @@ -990,6 +991,25 @@ out: > return error; > } > > +int sys_restart(unsigned long a0, unsigned long a1, unsigned long a2, > + unsigned long a3, unsigned long a4, unsigned long a5, > + struct pt_regs *regs) > +{ > + unsigned long flags; > + pid_t pid; > + int logfd; > + int fd; > + > + CHECK_FULL_REGS(regs); > + > + pid = a0; > + fd = a1; > + flags = a2; > + logfd = a3; > + > + return do_sys_restart(pid, fd, flags, logfd); > +} > + > #ifdef CONFIG_IRQSTACKS > static inline int valid_irq_stack(unsigned long sp, struct task_struct *p, > unsigned long nbytes) > -- > 1.6.0.6 > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype [not found] ` <1260219307.7151.3.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2009-12-09 16:52 ` Serge E. Hallyn @ 2009-12-09 17:02 ` Serge E. Hallyn 1 sibling, 0 replies; 9+ messages in thread From: Serge E. Hallyn @ 2009-12-09 17:02 UTC (permalink / raw) To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Here I guess is the s390 version. If we need the pt_regs later, we can get it using get_pt_regs(current) as the clone wrapper right above it does. Subject: [PATCH 1/1] define s390x sys_restart wrapper Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- arch/s390/kernel/process.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c index 0a59317..087f52c 100644 --- a/arch/s390/kernel/process.c +++ b/arch/s390/kernel/process.c @@ -241,6 +241,15 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags, parent_tidptr, child_tidptr); } +#ifdef CONFIG_CHECKPOINT +extern long do_sys_restart(pid_t pid, int fd, unsigned long flags, int logfd); +SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, unsigned long, flags, + int, logfd) +{ + return do_sys_restart(pid, fd, flags, logfd); +} +#endif + SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *, uca, int, args_size, pid_t __user *, pids) { -- 1.6.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-12-09 17:02 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-06 20:31 c/r: support for x86-64 arch Oren Laadan
[not found] ` <1260131469-2917-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-12-06 20:31 ` [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype Oren Laadan
[not found] ` <1260131469-2917-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-12-06 20:31 ` [PATCH] user-cr: eclone x86-64 wrapper Oren Laadan
[not found] ` <1260131469-2917-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-12-06 20:31 ` [PATCH 2/2] c/r: x86-64: checkpoint/restart implementation Oren Laadan
2009-12-06 20:35 ` [PATCH] user-cr: eclone x86-64 wrapper Oren Laadan
2009-12-06 22:51 ` [PATCH 1/2] c/r: [x86_32] sys_restore to use ptregs prototype Oren Laadan
[not found] ` <4B1C357C.2090003-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-12-07 20:55 ` Nathan Lynch
[not found] ` <1260219307.7151.3.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-12-09 16:52 ` Serge E. Hallyn
2009-12-09 17:02 ` Serge E. Hallyn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.