From: Dmitry Safonov <dsafonov@virtuozzo.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Shuah Khan <shuahkh@osg.samsung.com>,
"H. Peter Anvin" <hpa@zytor.com>, <0x7f454c46@gmail.com>,
<khorenko@virtuozzo.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Cyrill Gorcunov <gorcunov@openvz.org>,
Borislav Petkov <bp@alien8.de>, <xemul@virtuozzo.com>,
<linux-kselftest@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
X86 ML <x86@kernel.org>, Ingo Molnar <mingo@redhat.com>,
Dave Hansen <dave.hansen@linux.intel.com>
Subject: Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
Date: Thu, 7 Apr 2016 15:11:24 +0300 [thread overview]
Message-ID: <57064E6C.2030202@virtuozzo.com> (raw)
In-Reply-To: <CALCETrUhLVmHtEnt-RFMdPJ3E2T_798iWvkM8ciw=pydWCWjbw@mail.gmail.com>
On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
> [cc Dave Hansen for MPX]
>
> On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@virtuozzo.com> wrote:
>> Now each process that runs natively on x86_64 may execute 32-bit code
>> by proper setting it's CS selector: either from LDT or reuse Linux's
>> USER32_CS. The vice-versa is also valid: running 64-bit code in
>> compatible task is also possible by choosing USER_CS.
>> So we may switch between 32 and 64 bit code execution in any process.
>> Linux will choose the right syscall numbers in entries for those
>> processes. But it still will consider them native/compat by the
>> personality, that elf loader set on launch. This affects i.e., ptrace
>> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
>> according to process's mode (that's how strace detect task's
>> personality from 4.8 version).
>>
>> This patch adds arch_prctl calls for x86 that make possible to tell
>> Linux kernel in which mode the application is running currently.
>> Mainly, this is needed for CRIU: restoring compatible & native
>> applications both from 64-bit restorer. By that reason I wrapped all
>> the code in CONFIG_CHECKPOINT_RESTORE.
>> This patch solves also a problem for running 64-bit code in 32-bit elf
>> (and reverse), that you have only 32-bit elf vdso for fast syscalls.
>> When switching between native <-> compat mode by arch_prctl, it will
>> remap needed vdso binary blob for target mode.
> General comments first:
Thanks for your comments.
> You forgot about x32.
Will add x32 support for v2.
> I think that you should separate vdso remapping from "personality".
> vdso remapping should be available even on native 32-bit builds, which
> means that either you can't use arch_prctl for it or you'll have to
> wire up arch_prctl as a 32-bit syscall.
I cant say, I got your point. Do you mean by vdso remapping
mremap for vdso/vvar pages? I think, it should work now.
I did remapping for vdso as blob for native x86_64 task differs
to compatible task. So it's just changing blobs, address value
is there for convenience - I may omit it and just remap
different vdso blob at the same place where was previous vdso.
I'm not sure, why do we need possibility to map 64-bit vdso blob
on native 32-bit builds?
> For "personality", someone needs to enumerate all of the various thigs
> that try to track bitness and see how many of them even make sense.
> On brief inspection:
>
> - TIF_IA32: affects signal format and does something to ptrace. I
> suspect that whatever it does to ptrace is nonsensical, and I don't
> know whether we're stuck with it.
>
> - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
> isn't even done in a sensible way).
>
> - is_64bit_mm affects MPX and uprobes.
>
> On even more brief inspection:
>
> - uprobes using is_64bit_mm is buggy.
>
> - I doubt that having TASK_SIZE vary serves any purpose. Does anyone
> know why TASK_SIZE is different for different tasks? It would save
> code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.
> - Using TIF_IA32 for signal processing is IMO suboptimal. Instead,
> we should record which syscall installed the signal handler and use
> the corresponding frame format.
Oh, I like it, will do.
> - Using TIF_IA32 of the *target* for ptrace is nonsense. Having
> strace figure out syscall type using that is actively buggy, and I ran
> into that bug a few days ago and cursed at it. strace should inspect
> TS_COMPAT (I don't know how, but that's what should happen). We may
> be stuck with this for ABI reasons.
ptrace may check seg_32bit for code selector, what do you think?
> - For MPX, could we track which syscall called mpx_enable_management?
> I.e. can we stash in_compat_syscall's return from
> mpx_enable_management and use that instead of trying to determine the
> MPX data structure format by the mm's initial type?
>
> If we make all of these changes, we're left with *only* the ptrace
> thing, and you could certainly add a way for CRIU to switch that
> around.
That sounds really good!
>
>> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
>> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
>> Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
>> CC: Dmitry Safonov <0x7f454c46@gmail.com>
>> Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
>> ---
>> arch/x86/entry/vdso/vma.c | 76 ++++++++++++++++++++++++++--------
>> arch/x86/include/asm/vdso.h | 5 +++
>> arch/x86/include/uapi/asm/prctl.h | 6 +++
>> arch/x86/kernel/process_64.c | 87 +++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 157 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
>> index 10f704584922..9a1561da0bad 100644
>> --- a/arch/x86/entry/vdso/vma.c
>> +++ b/arch/x86/entry/vdso/vma.c
>> @@ -156,22 +156,21 @@ static int vvar_fault(const struct vm_special_mapping *sm,
>> return VM_FAULT_SIGBUS;
>> }
>>
>> -static int map_vdso(const struct vdso_image *image, bool calculate_addr)
>> +static int do_map_vdso(const struct vdso_image *image, bool calculate_addr,
>> + unsigned long addr)
>> {
>> struct mm_struct *mm = current->mm;
>> struct vm_area_struct *vma;
>> - unsigned long addr, text_start;
>> + unsigned long text_start;
>> int ret = 0;
>> static const struct vm_special_mapping vvar_mapping = {
>> .name = "[vvar]",
>> .fault = vvar_fault,
>> };
>>
>> - if (calculate_addr) {
>> + if (calculate_addr && !addr) {
>> addr = vdso_addr(current->mm->start_stack,
>> image->size - image->sym_vvar_start);
>> - } else {
>> - addr = 0;
>> }
>>
> This is overcomplicated. Just pass in the address and us it as supplied.
Will do.
>
>> down_write(&mm->mmap_sem);
>> @@ -209,11 +208,11 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
>> VM_PFNMAP,
>> &vvar_mapping);
>>
>> - if (IS_ERR(vma)) {
>> + if (IS_ERR(vma))
>> ret = PTR_ERR(vma);
>> - goto up_fail;
>> - }
>>
>> + if (ret)
>> + do_munmap(mm, addr, image->size - image->sym_vvar_start);
>> up_fail:
>> if (ret)
>> current->mm->context.vdso = NULL;
>> @@ -223,24 +222,28 @@ up_fail:
>> }
>>
>> #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
>> -static int load_vdso32(void)
>> +static int load_vdso32(unsigned long addr)
>> {
>> if (vdso32_enabled != 1) /* Other values all mean "disabled" */
>> return 0;
>>
>> - return map_vdso(&vdso_image_32, false);
>> + return do_map_vdso(&vdso_image_32, false, addr);
>> }
>> #endif
> I'd just make it one function do_map_vdso(type, addr).
Sure, will do
>
>> #ifdef CONFIG_X86_64
>> -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>> +static int load_vdso64(unsigned long addr)
>> {
>> if (!vdso64_enabled)
>> return 0;
>>
>> - return map_vdso(&vdso_image_64, true);
>> + return do_map_vdso(&vdso_image_64, true, addr);
>> }
>>
>> +int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>> +{
>> + return load_vdso64(0);
>> +}
>> #ifdef CONFIG_COMPAT
>> int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
>> int uses_interp)
>> @@ -250,20 +253,59 @@ int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
>> if (!vdso64_enabled)
>> return 0;
>>
>> - return map_vdso(&vdso_image_x32, true);
>> + return do_map_vdso(&vdso_image_x32, true, 0);
>> }
>> #endif
>> #ifdef CONFIG_IA32_EMULATION
>> - return load_vdso32();
>> + return load_vdso32(0);
> No special 0 please.
Yes, will use here vdso_addr(..) inplace.
>
>> #else
>> return 0;
>> #endif
>> }
>> -#endif
>> -#else
>> +#endif /* CONFIG_COMPAT */
>> +
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +unsigned long unmap_vdso(void)
>> +{
>> + struct vm_area_struct *vma;
>> + unsigned long addr = (unsigned long)current->mm->context.vdso;
>> +
>> + if (!addr)
>> + return 0;
>> +
>> + /* vvar pages */
>> + vma = find_vma(current->mm, addr - 1);
>> + if (vma)
>> + vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
>> +
>> + /* vdso pages */
>> + vma = find_vma(current->mm, addr);
>> + if (vma)
>> + vm_munmap(vma->vm_start, vma->vm_end - vma->vm_start);
>> +
>> + current->mm->context.vdso = NULL;
>> +
>> + return addr;
>> +}
>> +/*
>> + * Maps needed vdso type: vdso_image_32/vdso_image_64
>> + * @compatible - true for compatible, false for native vdso image
>> + * @addr - specify addr for vdso mapping (0 for random/searching)
>> + * NOTE: be sure to set/clear thread-specific flags before
>> + * calling this function.
>> + */
>> +int map_vdso(bool compatible, unsigned long addr)
>> +{
>> + if (compatible)
>> + return load_vdso32(addr);
>> + else
>> + return load_vdso64(addr);
>> +}
> This makes sense. But it can't be bool -- you forgot x32.
Sure, will rework for x32 support.
>
>> +#endif /* CONFIG_IA32_EMULATION && CONFIG_CHECKPOINT_RESTORE */
>> +#else /* !CONFIG_X86_64 */
>> int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>> {
>> - return load_vdso32();
>> + return load_vdso32(0);
>> }
>> #endif
>>
>> diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
>> index 43dc55be524e..3ead7cc48a68 100644
>> --- a/arch/x86/include/asm/vdso.h
>> +++ b/arch/x86/include/asm/vdso.h
>> @@ -39,6 +39,11 @@ extern const struct vdso_image vdso_image_x32;
>> extern const struct vdso_image vdso_image_32;
>> #endif
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +extern int map_vdso(bool to_compat, unsigned long addr);
>> +extern unsigned long unmap_vdso(void);
>> +#endif
>> +
>> extern void __init init_vdso_image(const struct vdso_image *image);
>>
>> #endif /* __ASSEMBLER__ */
>> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
>> index 3ac5032fae09..455844f06485 100644
>> --- a/arch/x86/include/uapi/asm/prctl.h
>> +++ b/arch/x86/include/uapi/asm/prctl.h
>> @@ -6,4 +6,10 @@
>> #define ARCH_GET_FS 0x1003
>> #define ARCH_GET_GS 0x1004
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +#define ARCH_SET_COMPAT 0x2001
>> +#define ARCH_SET_NATIVE 0x2002
>> +#define ARCH_GET_PERSONALITY 0x2003
>> +#endif
>> +
>> #endif /* _ASM_X86_PRCTL_H */
>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>> index 6cbab31ac23a..e50660d59530 100644
>> --- a/arch/x86/kernel/process_64.c
>> +++ b/arch/x86/kernel/process_64.c
>> @@ -49,6 +49,7 @@
>> #include <asm/debugreg.h>
>> #include <asm/switch_to.h>
>> #include <asm/xen/hypervisor.h>
>> +#include <asm/vdso.h>
>>
>> asmlinkage extern void ret_from_fork(void);
>>
>> @@ -505,6 +506,83 @@ void set_personality_ia32(bool x32)
>> }
>> EXPORT_SYMBOL_GPL(set_personality_ia32);
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> +/*
>> + * Check if there are still some vmas (except vdso) for current,
>> + * which placed above compatible TASK_SIZE.
>> + * Check also code, data, stack, args and env placements.
>> + * Returns true if all mappings are compatible.
>> + */
>> +static bool task_mappings_compatible(void)
>> +{
>> + struct mm_struct *mm = current->mm;
>> + unsigned long top_addr = IA32_PAGE_OFFSET;
>> + struct vm_area_struct *vma = find_vma(mm, top_addr);
>> +
>> + if (mm->end_code > top_addr ||
>> + mm->end_data > top_addr ||
>> + mm->start_stack > top_addr ||
>> + mm->brk > top_addr ||
>> + mm->arg_end > top_addr ||
>> + mm->env_end > top_addr)
>> + return false;
>> +
>> + while (vma) {
>> + if ((vma->vm_start != (unsigned long)mm->context.vdso) &&
>> + (vma->vm_end != (unsigned long)mm->context.vdso))
>> + return false;
>> +
>> + top_addr = vma->vm_end;
>> + vma = find_vma(mm, top_addr);
>> + }
>> +
>> + return true;
>> +}
> What goes wrong if there are leftover high mappings?
Nothing should. That's not expected by me
that someone will "hide" mappings over 32-bit
address space - that's the reason why I did it.
I'll drop this part.
>
>> +
>> +static int do_set_personality(bool compat, unsigned long addr)
>> +{
>> + int ret;
>> + unsigned long old_vdso_base;
>> + unsigned long old_mmap_base = current->mm->mmap_base;
>> +
>> + if (test_thread_flag(TIF_IA32) == compat) /* nothing to do */
>> + return 0;
> Please don't. Instead, remove TIF_IA32 entirely.
Thanks, I will remove TIF_IA32.
> Also, please separate out ARCH_REMAP_VDSO from any personality change API.
Ok.
>
>> +
>> + if (compat && !task_mappings_compatible())
>> + return -EFAULT;
>> +
>> + /*
>> + * We can't just remap vdso to needed location:
>> + * vdso compatible and native images differs
>> + */
>> + old_vdso_base = unmap_vdso();
>> +
>> + if (compat)
>> + set_personality_ia32(false);
>> + else
>> + set_personality_64bit();
>> +
>> + /*
>> + * Update mmap_base & get_unmapped_area helper, side effect:
>> + * one may change get_unmapped_area or mmap_base with personality()
>> + * or switching to and fro compatible mode
>> + */
>> + arch_pick_mmap_layout(current->mm);
>> +
>> + ret = map_vdso(compat, addr);
>> + if (ret) {
>> + current->mm->mmap_base = old_mmap_base;
>> + if (compat)
>> + set_personality_64bit();
>> + else
>> + set_personality_ia32(false);
>> + WARN_ON(map_vdso(!compat, old_vdso_base));
>> + }
>> +
>> + return ret;
>> +}
>> +#endif
>> +
>> long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>> {
>> int ret = 0;
>> @@ -592,6 +670,15 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>> break;
>> }
>>
>> +#if defined(CONFIG_IA32_EMULATION) && defined(CONFIG_CHECKPOINT_RESTORE)
>> + case ARCH_SET_COMPAT:
>> + return do_set_personality(true, addr);
>> + case ARCH_SET_NATIVE:
>> + return do_set_personality(false, addr);
>> + case ARCH_GET_PERSONALITY:
>> + return test_thread_flag(TIF_IA32);
>> +#endif
>> +
>> default:
>> ret = -EINVAL;
>> break;
>> --
>> 2.7.4
>>
--
Regards,
Dmitry Safonov
next prev parent reply other threads:[~2016-04-07 12:12 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-06 16:29 [PATCH 0/2] x86: add arch_prctl to switch between native/compat modes Dmitry Safonov
2016-04-06 16:29 ` [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode Dmitry Safonov
2016-04-06 18:04 ` Andy Lutomirski
2016-04-06 18:49 ` Andy Lutomirski
2016-04-07 12:11 ` Dmitry Safonov [this message]
2016-04-07 12:21 ` Cyrill Gorcunov
2016-04-07 12:35 ` Dmitry Safonov
2016-04-07 14:39 ` Andy Lutomirski
2016-04-07 15:18 ` Dmitry Safonov
2016-04-08 13:50 ` Dmitry Safonov
2016-04-08 15:56 ` Andy Lutomirski
2016-04-08 16:18 ` Dmitry Safonov
2016-04-08 20:44 ` Andy Lutomirski
2016-04-09 8:06 ` Dmitry Safonov
2016-04-13 16:55 ` Dmitry Safonov
2016-04-14 18:27 ` Andy Lutomirski
2016-04-20 11:04 ` Peter Zijlstra
2016-04-20 15:40 ` Andy Lutomirski
2016-04-20 19:05 ` Peter Zijlstra
2016-04-21 19:39 ` Andy Lutomirski
2016-04-21 20:12 ` Peter Zijlstra
2016-04-21 23:27 ` Andy Lutomirski
2016-04-21 23:46 ` Andy Lutomirski
2016-04-25 15:16 ` Peter Zijlstra
2016-04-25 16:50 ` Andy Lutomirski
2016-04-06 16:29 ` [PATCH 2/2] x86/tools/testing: add test for ARCH_SET_COMPAT Dmitry Safonov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57064E6C.2030202@virtuozzo.com \
--to=dsafonov@virtuozzo.com \
--cc=0x7f454c46@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=gorcunov@openvz.org \
--cc=hpa@zytor.com \
--cc=khorenko@virtuozzo.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mingo@redhat.com \
--cc=shuahkh@osg.samsung.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
--cc=xemul@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox