* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas [not found] <20200530055953.817666-1-krisman@collabora.com> @ 2020-05-30 17:30 ` Kees Cook 2020-05-31 5:56 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 9+ messages in thread From: Kees Cook @ 2020-05-30 17:30 UTC (permalink / raw) To: Gabriel Krisman Bertazi Cc: linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, Paul Gofman, linux-security-module On Sat, May 30, 2020 at 01:59:53AM -0400, Gabriel Krisman Bertazi wrote: > Modern Windows applications are executing system call instructions > directly from the application's code without going through the WinAPI. > This breaks Wine emulation, because it doesn't have a chance to > intercept and emulate these syscalls before they are submitted to Linux. > > In addition, we cannot simply trap every system call of the application > to userspace using PTRACE_SYSEMU, because performance would suffer, > since our main use case is to run Windows games over Linux. Therefore, > we need some in-kernel filtering to decide whether the syscall was > issued by the wine code or by the windows application. Interesting use-case! It seems like you're in the position of needing to invert the assumption about syscalls: before you knew everything going through WinAPI needed emulation, and now you need to assume everything not going through a native library needs emulation. Oof. Is it possible to disassemble and instrument the Windows code to insert breakpoints (or emulation calls) at all the Windows syscall points? > [...] > * Why not SECCOMP_MODE_FILTER? > > We experimented with dynamically generating BPF filters for whitelisted > memory regions and using SECCOMP_MODE_FILTER, but there are a few > reasons why it isn't enough nor a good idea for our use case: > > 1. We cannot set the filters at program initialization time and forget > about it, since there is no way of knowing which modules will be loaded, > whether native and windows. Filter would need a way to be updated > frequently during game execution. > > 2. We cannot predict which Linux libraries will issue syscalls directly. > Most of the time, whitelisting libc and a few other libraries is enough, > but there are no guarantees other Linux libraries won't issue syscalls > directly and break the execution. Adding every linux library that is > loaded also has a large performance cost due to the large resulting > filter. Just so I can understand the expected use: given the dynamic nature of the library loading, how would Wine be marking the VMAs? > 3. As I mentioned before, performance is critical. In our testing with > just a single memory segment blacklisted/whitelisted, the minimum size > of a bpf filter would be 4 instructions. In that scenario, > SECCOMP_MODE_FILTER added an average overhead of 10% to the execution > time of sysinfo(2) in comparison to seccomp disabled, while the impact > of SECCOMP_MODE_MEMMAP was averaged around 1.5%. Was the BPF JIT enabled? I was recently examining filter performance too: https://lore.kernel.org/linux-security-module/202005291043.A63D910A8@keescook/ > Indeed, points 1 and 2 could be worked around with some userspace work > and improved SECCOMP_MODE_FILTER support, but at a high performance and > some stability cost, to obtain the semantics we want. Still, the > performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive > enough that I believe it should be considered as an upstream solution. It looks like you're using SECCOMP_RET_TRAP for this? Signal handling can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF? > Sending as an RFC for now to get the discussion started. In particular: > > 1) Would this design be acceptable? Maybe? I want to spend a little time exploring alternatives first. :) > 2) I'm not comfortable with using the high part of vm_flags, though I'm > not sure where else it would fit. Suggestions? I don't know this area very well. Hopefully some of the mm folks can comment on it. It seems like there needs to be a more dynamic way mark VMAs for arbitrary purposes (since this isn't _actually_ a PROT* flag, it's just a marking for seccomp to check). Regardless, here's a review of the specifics of patch, without regard to the design question above. :) > > CC: Thomas Gleixner <tglx@linutronix.de> > CC: Kees Cook <keescook@chromium.org> > CC: Andy Lutomirski <luto@amacapital.net> > CC: Will Drewry <wad@chromium.org> > CC: H. Peter Anvin <hpa@zytor.com> > CC: Paul Gofman <gofmanp@gmail.com> > Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> > --- > arch/Kconfig | 21 +++++++++ > arch/x86/Kconfig | 1 + > include/linux/mm.h | 8 ++++ > include/linux/mman.h | 4 +- > include/uapi/asm-generic/mman-common.h | 2 + > include/uapi/linux/seccomp.h | 2 + > kernel/seccomp.c | 59 ++++++++++++++++++++++++++ > mm/mprotect.c | 2 +- > 8 files changed, 97 insertions(+), 2 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 786a85d4ad40..e5c10231c9c2 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -471,6 +471,27 @@ config SECCOMP_FILTER > > See Documentation/userspace-api/seccomp_filter.rst for details. > > +config HAVE_ARCH_SECCOMP_MEMMAP > + bool > + help > + An arch should select this symbol if it provides all of these things: > + - syscall_get_arch() > + - syscall_get_arguments() > + - syscall_rollback() > + - syscall_set_return_value() > + - SIGSYS siginfo_t support > + - secure_computing is called from a ptrace_event()-safe context > + - secure_computing return value is checked and a return value of -1 > + results in the system call being skipped immediately. > + - seccomp syscall wired up > + > +config SECCOMP_MEMMAP > + def_bool y > + depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP > + help > + Enable tasks to trap syscalls based on the page that triggered > + the syscall. > + Since I don't see anything distinct in the requirements for SECCOMP_MEMMAP, so I don't think it needs a separate CONFIG: it is almost entirely built on the SECCOMP_FILTER infrastructure. I'd just drop the CONFIGs. > config HAVE_ARCH_STACKLEAK > bool > help > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 2d3f963fd6f1..70998b01c533 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -145,6 +145,7 @@ config X86 > select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > + select HAVE_ARCH_SECCOMP_MEMMAP (and it's not arch-specific) > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_STACKLEAK > select HAVE_ARCH_TRACEHOOK > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5a323422d783..6271fb7fd5bc 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp); > #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ > #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ > #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ > +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ > #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) > #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) > #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) > #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) > #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) > +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) > #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ > > #ifdef CONFIG_ARCH_HAS_PKEYS > @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp); > # define VM_GROWSUP VM_NONE > #endif > > +#if defined(CONFIG_X86) > +# define VM_NOSYSCALL VM_HIGH_ARCH_5 > +#else > +# define VM_NOSYSCALL VM_NONE > +#endif > + > /* Bits set in the VMA until the stack is in its final location */ > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) > I have no idea what the correct way to mark VMAs would be. As you mentioned, this seems ... wrong. :) > diff --git a/include/linux/mman.h b/include/linux/mman.h > index 4b08e9c9c538..a5ca42eb685a 100644 > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages) > */ > static inline bool arch_validate_prot(unsigned long prot, unsigned long addr) > { > - return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0; > + return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM > + | PROT_NOSYSCALL)) == 0; > } > #define arch_validate_prot arch_validate_prot > #endif > @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey) > return _calc_vm_trans(prot, PROT_READ, VM_READ ) | > _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) | > _calc_vm_trans(prot, PROT_EXEC, VM_EXEC) | > + _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) | > arch_calc_vm_prot_bits(prot, pkey); > } > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index f94f65d429be..4eafbaa3f4bc 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -13,6 +13,8 @@ > #define PROT_SEM 0x8 /* page may be used for atomic ops */ > /* 0x10 reserved for arch-specific use */ > /* 0x20 reserved for arch-specific use */ > +#define PROT_NOSYSCALL 0x40 /* page cannot trigger syscalls */ > + > #define PROT_NONE 0x0 /* page can not be accessed */ > #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ > #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ AIUI, all of the above is to plumb the VMA marking through an mprotect() call. I wonder if perhaps madvise() would be better? I'm not sure how tight we are on new flags there, but I think it would be cleaner to use that interface. Take a look at MADV_WIPEONFORK / VM_WIPEONFORK. > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index c1735455bc53..c7d042a409e7 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -10,12 +10,14 @@ > #define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ > #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ > #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ > +#define SECCOMP_MODE_MEMMAP 3 /* Lock syscalls per memory region. */ Making this incompatible with FILTER might cause problems for the future (more and more process launchers are starting to set filters). So this would, perhaps, be a new flag for struct seccomp, rather than a new operating mode. > > /* Valid operations for seccomp syscall. */ > #define SECCOMP_SET_MODE_STRICT 0 > #define SECCOMP_SET_MODE_FILTER 1 > #define SECCOMP_GET_ACTION_AVAIL 2 > #define SECCOMP_GET_NOTIF_SIZES 3 > +#define SECCOMP_SET_MODE_MEMMAP 4 > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 55a6184f5990..ebf09b02db8d 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, > } > #endif > > +#ifdef CONFIG_SECCOMP_MEMMAP > +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd) > +{ > + struct vm_area_struct *vma = find_vma(current->mm, > + sd->instruction_pointer); > + if (!vma) > + BUG(); No new kernel code should use BUG: https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on I would maybe pr_warn_once(), but then treat it as if it was marked with VM_NOSYSCALL. > + > + if (!(vma->vm_flags & VM_NOSYSCALL)) > + return 0; > + > + syscall_rollback(current, task_pt_regs(current)); > + seccomp_send_sigsys(this_syscall, 0); > + > + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true); > + > + return -1; > +} This really just looks like an ip_address filter, but I get what you mean about stacking filters, etc. This may finally be the day we turn to eBPF in seccomp, since that would give you access to using a map lookup on ip_address, and the map could be updated externally (removing the need for the madvise() changes). > + > +static long seccomp_set_mode_memmap(unsigned int flags) > +{ > + const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP; > + long ret = 0; > + > + if (flags & SECCOMP_FILTER_FLAG_TSYNC) > + return -EINVAL; This should just test for any bits in flags. TSYNC + memmap should be avoided, IMO. > + > + spin_lock_irq(¤t->sighand->siglock); > + > + if (seccomp_may_assign_mode(seccomp_mode)) > + seccomp_assign_mode(current, seccomp_mode, flags); > + else > + ret = -EINVAL; > + > + spin_unlock_irq(¤t->sighand->siglock); > + > + return ret; > +} > +#else > +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd) > +{ > + BUG(); > +} > +static long seccomp_set_mode_memmap(unsigned int flags) > +{ > + return -EINVAL; > +} > +#endif /* CONFIG_SECCOMP_MEMMAP */ > + > int __secure_computing(const struct seccomp_data *sd) > { > int mode = current->seccomp.mode; > @@ -948,6 +997,8 @@ int __secure_computing(const struct seccomp_data *sd) > return 0; > case SECCOMP_MODE_FILTER: > return __seccomp_filter(this_syscall, sd, false); > + case SECCOMP_MODE_MEMMAP: > + return __seccomp_memmap(this_syscall, sd); > default: > BUG(); > } > @@ -1425,6 +1476,10 @@ static long do_seccomp(unsigned int op, unsigned int flags, > return -EINVAL; > > return seccomp_get_notif_sizes(uargs); > + case SECCOMP_SET_MODE_MEMMAP: > + if (uargs != NULL) > + return -EINVAL; > + return seccomp_set_mode_memmap(flags); > default: > return -EINVAL; > } > @@ -1462,6 +1517,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter) > op = SECCOMP_SET_MODE_FILTER; > uargs = filter; > break; > + case SECCOMP_MODE_MEMMAP: > + op = SECCOMP_SET_MODE_MEMMAP; > + uargs = NULL; > + break; The prctl() interface is deprecated, so no new features should be added there. > default: > return -EINVAL; > } > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 494192ca954b..6b5c00e8bbdc 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, > * cleared from the VMA. > */ > mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC | > - VM_FLAGS_CLEAR; > + VM_NOSYSCALL | VM_FLAGS_CLEAR; > > new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey); > newflags = calc_vm_prot_bits(prot, new_vma_pkey); > -- > 2.27.0.rc2 > So, yes, interesting. I'll continue to think about how this could work best. Can you share what your filters looked like when you were originally trying those? Thanks! -Kees -- Kees Cook ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-30 17:30 ` [PATCH RFC] seccomp: Implement syscall isolation based on memory areas Kees Cook @ 2020-05-31 5:56 ` Gabriel Krisman Bertazi 2020-05-31 12:39 ` Paul Gofman 0 siblings, 1 reply; 9+ messages in thread From: Gabriel Krisman Bertazi @ 2020-05-31 5:56 UTC (permalink / raw) To: Kees Cook Cc: linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, Paul Gofman, linux-security-module Kees Cook <keescook@chromium.org> writes: > On Sat, May 30, 2020 at 01:59:53AM -0400, Gabriel Krisman Bertazi wrote: >> Modern Windows applications are executing system call instructions >> directly from the application's code without going through the WinAPI. >> This breaks Wine emulation, because it doesn't have a chance to >> intercept and emulate these syscalls before they are submitted to Linux. >> >> In addition, we cannot simply trap every system call of the application >> to userspace using PTRACE_SYSEMU, because performance would suffer, >> since our main use case is to run Windows games over Linux. Therefore, >> we need some in-kernel filtering to decide whether the syscall was >> issued by the wine code or by the windows application. > > Interesting use-case! It seems like you're in the position of needing to > invert the assumption about syscalls: before you knew everything going > through WinAPI needed emulation, and now you need to assume everything > not going through a native library needs emulation. Oof. > > Is it possible to disassemble and instrument the Windows code to insert > breakpoints (or emulation calls) at all the Windows syscall points? Hi Kees, I considered instrumenting the syscall instructions with calls to some wrapper, but I was told that modifying the game in memory or on disk will trigger all sorts of anti-cheating mechanisms (my main use case are windows games). >> [...] >> * Why not SECCOMP_MODE_FILTER? >> >> We experimented with dynamically generating BPF filters for whitelisted >> memory regions and using SECCOMP_MODE_FILTER, but there are a few >> reasons why it isn't enough nor a good idea for our use case: >> >> 1. We cannot set the filters at program initialization time and forget >> about it, since there is no way of knowing which modules will be loaded, >> whether native and windows. Filter would need a way to be updated >> frequently during game execution. >> >> 2. We cannot predict which Linux libraries will issue syscalls directly. >> Most of the time, whitelisting libc and a few other libraries is enough, >> but there are no guarantees other Linux libraries won't issue syscalls >> directly and break the execution. Adding every linux library that is >> loaded also has a large performance cost due to the large resulting >> filter. > > Just so I can understand the expected use: given the dynamic nature of > the library loading, how would Wine be marking the VMAs? Paul (cc'ed) is the wine expert, but my understanding is that memory allocation and initial program load of the emulated binary will go through wine. It does the allocation and mark the vma accordingly before returning the allocated range to the windows application. >> 3. As I mentioned before, performance is critical. In our testing with >> just a single memory segment blacklisted/whitelisted, the minimum size >> of a bpf filter would be 4 instructions. In that scenario, >> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution >> time of sysinfo(2) in comparison to seccomp disabled, while the impact >> of SECCOMP_MODE_MEMMAP was averaged around 1.5%. > > Was the BPF JIT enabled? I was recently examining filter performance too: > https://lore.kernel.org/linux-security-module/202005291043.A63D910A8@keescook/ yes: root@dilma:~# sysctl -a | grep -i jit net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 > >> Indeed, points 1 and 2 could be worked around with some userspace work >> and improved SECCOMP_MODE_FILTER support, but at a high performance and >> some stability cost, to obtain the semantics we want. Still, the >> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive >> enough that I believe it should be considered as an upstream solution. > > It looks like you're using SECCOMP_RET_TRAP for this? Signal handling > can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF? I experimented with SECCOMP_RET_TRAP and SECCOMP_RET_TRACE, but I hadn't consider USER_NOTIF. It seems to be a quite recent feature and I wasn't aware of it. I will try it and let you know. >> diff --git a/include/linux/mman.h b/include/linux/mman.h >> index 4b08e9c9c538..a5ca42eb685a 100644 >> --- a/include/linux/mman.h >> +++ b/include/linux/mman.h >> @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages) >> */ >> static inline bool arch_validate_prot(unsigned long prot, unsigned long addr) >> { >> - return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0; >> + return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM >> + | PROT_NOSYSCALL)) == 0; >> } >> #define arch_validate_prot arch_validate_prot >> #endif >> @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey) >> return _calc_vm_trans(prot, PROT_READ, VM_READ ) | >> _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) | >> _calc_vm_trans(prot, PROT_EXEC, VM_EXEC) | >> + _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) | >> arch_calc_vm_prot_bits(prot, pkey); >> } >> >> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h >> index f94f65d429be..4eafbaa3f4bc 100644 >> --- a/include/uapi/asm-generic/mman-common.h >> +++ b/include/uapi/asm-generic/mman-common.h >> @@ -13,6 +13,8 @@ >> #define PROT_SEM 0x8 /* page may be used for atomic ops */ >> /* 0x10 reserved for arch-specific use */ >> /* 0x20 reserved for arch-specific use */ >> +#define PROT_NOSYSCALL 0x40 /* page cannot trigger syscalls */ >> + >> #define PROT_NONE 0x0 /* page can not be accessed */ >> #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ >> #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ > > AIUI, all of the above is to plumb the VMA marking through an mprotect() > call. I wonder if perhaps madvise() would be better? I'm not sure how > tight we are on new flags there, but I think it would be cleaner to use > that interface. Take a look at MADV_WIPEONFORK / VM_WIPEONFORK. Right, will do. I considered it, but wasn't sure if madvise semantics gave any guarantees of the result, but is seems to be the case for at least for WIPEONFORK. it is important to have it configurable on mmap too, to avoid the extra syscall. Plumbing through the mmap flags argument shouldn't be a problem. >> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h >> index c1735455bc53..c7d042a409e7 100644 >> --- a/include/uapi/linux/seccomp.h >> +++ b/include/uapi/linux/seccomp.h >> @@ -10,12 +10,14 @@ >> #define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ >> #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ >> #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ >> +#define SECCOMP_MODE_MEMMAP 3 /* Lock syscalls per memory region. */ > > Making this incompatible with FILTER might cause problems for the future > (more and more process launchers are starting to set filters). > > So this would, perhaps, be a new flag for struct seccomp, rather than a > new operating mode. Yes, agreed. I guess it would be a real scenario, even for us, to have isolation against malicious syscalls (MODE_FILTER) combined with region protection (MEMAP) for emulation. > >> >> /* Valid operations for seccomp syscall. */ >> #define SECCOMP_SET_MODE_STRICT 0 >> #define SECCOMP_SET_MODE_FILTER 1 >> #define SECCOMP_GET_ACTION_AVAIL 2 >> #define SECCOMP_GET_NOTIF_SIZES 3 >> +#define SECCOMP_SET_MODE_MEMMAP 4 >> >> /* Valid flags for SECCOMP_SET_MODE_FILTER */ >> #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> index 55a6184f5990..ebf09b02db8d 100644 >> --- a/kernel/seccomp.c >> +++ b/kernel/seccomp.c >> @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, >> } >> #endif >> >> +#ifdef CONFIG_SECCOMP_MEMMAP >> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd) >> +{ >> + struct vm_area_struct *vma = find_vma(current->mm, >> + sd->instruction_pointer); >> + if (!vma) >> + BUG(); > > No new kernel code should use BUG: > https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on > > I would maybe pr_warn_once(), but then treat it as if it was marked with > VM_NOSYSCALL. > >> + >> + if (!(vma->vm_flags & VM_NOSYSCALL)) >> + return 0; >> + >> + syscall_rollback(current, task_pt_regs(current)); >> + seccomp_send_sigsys(this_syscall, 0); >> + >> + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true); >> + >> + return -1; >> +} > > This really just looks like an ip_address filter, but I get what you > mean about stacking filters, etc. This may finally be the day we turn to > eBPF in seccomp, since that would give you access to using a map lookup > on ip_address, and the map could be updated externally (removing the > need for the madvise() changes). And 64-bit comparisons :) that would be a good solution, we'd still need to look at some details, like disabling/updating filters at runtime when some new library is loaded, but since we can update externally, I think it covers it. From the wine side, it should be enough to get the semantics we want, but I don't know about performance. >> + >> +static long seccomp_set_mode_memmap(unsigned int flags) >> +{ >> + const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP; >> + long ret = 0; >> + >> + if (flags & SECCOMP_FILTER_FLAG_TSYNC) >> + return -EINVAL; > > This should just test for any bits in flags. TSYNC + memmap should be > avoided, IMO. Maybe it is worth to expose SECCOMP_FILTER_FLAG_LOG? >> default: >> return -EINVAL; >> } >> diff --git a/mm/mprotect.c b/mm/mprotect.c >> index 494192ca954b..6b5c00e8bbdc 100644 >> --- a/mm/mprotect.c >> +++ b/mm/mprotect.c >> @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, >> * cleared from the VMA. >> */ >> mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC | >> - VM_FLAGS_CLEAR; >> + VM_NOSYSCALL | VM_FLAGS_CLEAR; >> >> new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey); >> newflags = calc_vm_prot_bits(prot, new_vma_pkey); >> -- >> 2.27.0.rc2 >> > > So, yes, interesting. I'll continue to think about how this could work > best. Can you share what your filters looked like when you were > originally trying those? The results I shared in the previous email had the following filter: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, 0xf000, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), Which is much much simpler than the real thing. I also had a small filter generator that generated segments like the following ( ip_HW and ip_LW are the high and low part of the instruction pointer): 0007: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, ip_HW) 0008: BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, EH, 0, REL_ADDR(next)) 0009: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, ip_LW) 000a: BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, EL, REL_ADDR(next), 0) 000b: BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, SL, REL_ADDR(filter->jump_allow_target), REL_ADDR(next)) 000c: next: Thanks, -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-31 5:56 ` Gabriel Krisman Bertazi @ 2020-05-31 12:39 ` Paul Gofman 2020-05-31 16:49 ` Matthew Wilcox 0 siblings, 1 reply; 9+ messages in thread From: Paul Gofman @ 2020-05-31 12:39 UTC (permalink / raw) To: Gabriel Krisman Bertazi, Kees Cook Cc: linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, linux-security-module, Zebediah Figura Hi! thanks for looking into this. On 5/31/20 08:56, Gabriel Krisman Bertazi wrote: > >> Is it possible to disassemble and instrument the Windows code to insert >> breakpoints (or emulation calls) at all the Windows syscall points? > Hi Kees, > > I considered instrumenting the syscall instructions with calls to some > wrapper, but I was told that modifying the game in memory or on disk > will trigger all sorts of anti-cheating mechanisms (my main use case are > windows games). Yes, this is the case. Besides, before instrumenting, we would need some way to find those syscalls in the highly obfuscated dynamically generated code, the whole purpose of which is to prevent disassembling, debugging and finding things like that in it. Ultimately, even for the cases when it would be technically feasible I don't think Wine could ever go for modifying the program's code (unless of course this is the part of what the program is doing itself using winapi calls). Wine is trying to implement the API as close to Windows as possible so the DRM can work with Wine, modifying the program's code is something different. > >>> [...] >>> * Why not SECCOMP_MODE_FILTER? >>> >>> We experimented with dynamically generating BPF filters for whitelisted >>> memory regions and using SECCOMP_MODE_FILTER, but there are a few >>> reasons why it isn't enough nor a good idea for our use case: >>> >>> 1. We cannot set the filters at program initialization time and forget >>> about it, since there is no way of knowing which modules will be loaded, >>> whether native and windows. Filter would need a way to be updated >>> frequently during game execution. >>> >>> 2. We cannot predict which Linux libraries will issue syscalls directly. >>> Most of the time, whitelisting libc and a few other libraries is enough, >>> but there are no guarantees other Linux libraries won't issue syscalls >>> directly and break the execution. Adding every linux library that is >>> loaded also has a large performance cost due to the large resulting >>> filter. >> Just so I can understand the expected use: given the dynamic nature of >> the library loading, how would Wine be marking the VMAs? > Paul (cc'ed) is the wine expert, but my understanding is that memory > allocation and initial program load of the emulated binary will go > through wine. It does the allocation and mark the vma accordingly > before returning the allocated range to the windows application. Yes, exactly. Pretty much any memory allocation which Wine does needs syscalls (if those are ever encountered later during executing code from those areas) to be trapped by Wine and passed to Wine's implementation of the corresponding Windows API function. Linux native libraries loading and memory allocations performed by them go outside of Wine control. > >>> Indeed, points 1 and 2 could be worked around with some userspace work >>> and improved SECCOMP_MODE_FILTER support, but at a high performance and >>> some stability cost, to obtain the semantics we want. Still, the >>> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive >>> enough that I believe it should be considered as an upstream solution. >> It looks like you're using SECCOMP_RET_TRAP for this? Signal handling >> can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF? We are not much concerned with the overhead of the trapped syscall in our use case, those are very rare. What we are concerned with is the performance impact on the normal Linux syscalls when the syscall trapping is enabled. When I was measuring that impact I've got the same 10% overhead for the untrapped syscalls (that is, always hitting SECCOMP_RET_ALLOW case) with the filters Gabriel mentioned. >> >>> + >>> + if (!(vma->vm_flags & VM_NOSYSCALL)) >>> + return 0; >>> + >>> + syscall_rollback(current, task_pt_regs(current)); >>> + seccomp_send_sigsys(this_syscall, 0); >>> + >>> + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true); >>> + >>> + return -1; >>> +} >> This really just looks like an ip_address filter, but I get what you >> mean about stacking filters, etc. This may finally be the day we turn to >> eBPF in seccomp, since that would give you access to using a map lookup >> on ip_address, and the map could be updated externally (removing the >> need for the madvise() changes). > And 64-bit comparisons :) > > that would be a good solution, we'd still need to look at some details, > like disabling/updating filters at runtime when some new library is > loaded, but since we can update externally, I think it covers it. I am afraid that for a general case the filter add / update / removal should be done not just when a new library is loaded or unloaded but pretty much any time new (executable) memory region is allocated or deallocated, or PROT_EXEC should be changed on allocated pages . But I am not sure I've got enough details yet on the suggested approach here and might be missing important details. I guess maybe we could discuss the details separately with Gabriel first. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-31 12:39 ` Paul Gofman @ 2020-05-31 16:49 ` Matthew Wilcox 2020-05-31 17:10 ` Paul Gofman 2020-06-01 17:53 ` Gabriel Krisman Bertazi 0 siblings, 2 replies; 9+ messages in thread From: Matthew Wilcox @ 2020-05-31 16:49 UTC (permalink / raw) To: Paul Gofman Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, linux-security-module, Zebediah Figura On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote: > > Paul (cc'ed) is the wine expert, but my understanding is that memory > > allocation and initial program load of the emulated binary will go > > through wine. It does the allocation and mark the vma accordingly > > before returning the allocated range to the windows application. > Yes, exactly. Pretty much any memory allocation which Wine does needs > syscalls (if those are ever encountered later during executing code from > those areas) to be trapped by Wine and passed to Wine's implementation > of the corresponding Windows API function. Linux native libraries > loading and memory allocations performed by them go outside of Wine control. I don't like Gabriel's approach very much. Could we do something like issue a syscall before executing a Windows region and then issue another syscall when exiting? If so, we could switch the syscall entry point (ie change MSR_LSTAR). I'm thinking something like a personality() syscall. But maybe that would be too high an overhead. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-31 16:49 ` Matthew Wilcox @ 2020-05-31 17:10 ` Paul Gofman 2020-05-31 17:31 ` Matthew Wilcox 2020-06-01 17:53 ` Gabriel Krisman Bertazi 1 sibling, 1 reply; 9+ messages in thread From: Paul Gofman @ 2020-05-31 17:10 UTC (permalink / raw) To: Matthew Wilcox Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, linux-security-module, Zebediah Figura On 5/31/20 19:49, Matthew Wilcox wrote: > On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote: >>> Paul (cc'ed) is the wine expert, but my understanding is that memory >>> allocation and initial program load of the emulated binary will go >>> through wine. It does the allocation and mark the vma accordingly >>> before returning the allocated range to the windows application. >> Yes, exactly. Pretty much any memory allocation which Wine does needs >> syscalls (if those are ever encountered later during executing code from >> those areas) to be trapped by Wine and passed to Wine's implementation >> of the corresponding Windows API function. Linux native libraries >> loading and memory allocations performed by them go outside of Wine control. > I don't like Gabriel's approach very much. Could we do something like > issue a syscall before executing a Windows region and then issue another > syscall when exiting? If so, we could switch the syscall entry point (ie > change MSR_LSTAR). I'm thinking something like a personality() syscall. > But maybe that would be too high an overhead. > IIRC Gabriel had such idea that we discussed. We can potentially track the boundary between the Windows and native code exectution. But issuing syscall every time we cross that boundary may have a prohibitive performance impact, that happens way too often. What we could do is to put the flag somewhere, but that flag has to be per thread. E. g., we could use Linux gs: based thread local storage, or fs: based address (that's what Windows using for thread local data and thus Wine maintains also). If Seccomp filters could access such a memory location (fetch a byte from there and put into the structure accessible by BPF_LD) we could use SECCOMP_MODE_FILTER, I think. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-31 17:10 ` Paul Gofman @ 2020-05-31 17:31 ` Matthew Wilcox 2020-05-31 18:01 ` Paul Gofman 0 siblings, 1 reply; 9+ messages in thread From: Matthew Wilcox @ 2020-05-31 17:31 UTC (permalink / raw) To: Paul Gofman Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, linux-security-module, Zebediah Figura On Sun, May 31, 2020 at 08:10:18PM +0300, Paul Gofman wrote: > On 5/31/20 19:49, Matthew Wilcox wrote: > > On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote: > >>> Paul (cc'ed) is the wine expert, but my understanding is that memory > >>> allocation and initial program load of the emulated binary will go > >>> through wine. It does the allocation and mark the vma accordingly > >>> before returning the allocated range to the windows application. > >> Yes, exactly. Pretty much any memory allocation which Wine does needs > >> syscalls (if those are ever encountered later during executing code from > >> those areas) to be trapped by Wine and passed to Wine's implementation > >> of the corresponding Windows API function. Linux native libraries > >> loading and memory allocations performed by them go outside of Wine control. > > I don't like Gabriel's approach very much. Could we do something like > > issue a syscall before executing a Windows region and then issue another > > syscall when exiting? If so, we could switch the syscall entry point (ie > > change MSR_LSTAR). I'm thinking something like a personality() syscall. > > But maybe that would be too high an overhead. > > IIRC Gabriel had such idea that we discussed. We can potentially track > the boundary between the Windows and native code exectution. But issuing > syscall every time we cross that boundary may have a prohibitive > performance impact, that happens way too often. What we could do is to > put the flag somewhere, but that flag has to be per thread. E. g., we > could use Linux gs: based thread local storage, or fs: based address > (that's what Windows using for thread local data and thus Wine maintains > also). If Seccomp filters could access such a memory location (fetch a > byte from there and put into the structure accessible by BPF_LD) we > could use SECCOMP_MODE_FILTER, I think. If it's the cost of the syscall that's the problem, there are ways around that. We'd still want a personality() call to indicate that the syscall handler should look (somewhere) to determine the current personality, but that could be issued at the start of execution rather than when we switch between Windows & Linux code. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-31 17:31 ` Matthew Wilcox @ 2020-05-31 18:01 ` Paul Gofman 2020-06-01 17:54 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 9+ messages in thread From: Paul Gofman @ 2020-05-31 18:01 UTC (permalink / raw) To: Matthew Wilcox Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, linux-security-module, Zebediah Figura On 5/31/20 20:31, Matthew Wilcox wrote: > If it's the cost of the syscall that's the problem, there are ways > around that. We'd still want a personality() call to indicate that > the syscall handler should look (somewhere) to determine the current > personality, but that could be issued at the start of execution rather > than when we switch between Windows & Linux code. Sure, we can call personality() at start and specify the location to look at, the only thing is that the location should be thread specific, that is, based on fs: or gs: or whatever else which would allow us to have different threads in different "personality" state. If anything needs to be set up at thread start we can do that also of course. If there will be any proof of concept solution I will be happy to make a proof of concept Wine patch using that and do some testing. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-31 18:01 ` Paul Gofman @ 2020-06-01 17:54 ` Gabriel Krisman Bertazi 0 siblings, 0 replies; 9+ messages in thread From: Gabriel Krisman Bertazi @ 2020-06-01 17:54 UTC (permalink / raw) To: Paul Gofman Cc: Matthew Wilcox, Kees Cook, linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, linux-security-module, Zebediah Figura Paul Gofman <gofmanp@gmail.com> writes: > On 5/31/20 20:31, Matthew Wilcox wrote: >> If it's the cost of the syscall that's the problem, there are ways >> around that. We'd still want a personality() call to indicate that >> the syscall handler should look (somewhere) to determine the current >> personality, but that could be issued at the start of execution rather >> than when we switch between Windows & Linux code. > > Sure, we can call personality() at start and specify the location to > look at, the only thing is that the location should be thread specific, > that is, based on fs: or gs: or whatever else which would allow us to > have different threads in different "personality" state. If anything > needs to be set up at thread start we can do that also of course. > > If there will be any proof of concept solution I will be happy to make a > proof of concept Wine patch using that and do some testing. Let me give that a try and share the patches with you, so we can look at how this implementation would look like. -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas 2020-05-31 16:49 ` Matthew Wilcox 2020-05-31 17:10 ` Paul Gofman @ 2020-06-01 17:53 ` Gabriel Krisman Bertazi 1 sibling, 0 replies; 9+ messages in thread From: Gabriel Krisman Bertazi @ 2020-06-01 17:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Paul Gofman, Kees Cook, linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin, linux-security-module, Zebediah Figura Matthew Wilcox <willy@infradead.org> writes: > On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote: >> > Paul (cc'ed) is the wine expert, but my understanding is that memory >> > allocation and initial program load of the emulated binary will go >> > through wine. It does the allocation and mark the vma accordingly >> > before returning the allocated range to the windows application. >> Yes, exactly. Pretty much any memory allocation which Wine does needs >> syscalls (if those are ever encountered later during executing code from >> those areas) to be trapped by Wine and passed to Wine's implementation >> of the corresponding Windows API function. Linux native libraries >> loading and memory allocations performed by them go outside of Wine control. > > I don't like Gabriel's approach very much. Could we do something like Hi Matthew, I don't oppose your suggestion, as Paul said, it should be enough for us. But could you elaborate on the problems you see in the original approach, even if only for my own education? > issue a syscall before executing a Windows region and then issue another > syscall when exiting? If so, we could switch the syscall entry point (ie > change MSR_LSTAR). I'm thinking something like a personality() syscall. > But maybe that would be too high an overhead. > -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-06-01 17:54 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20200530055953.817666-1-krisman@collabora.com>
2020-05-30 17:30 ` [PATCH RFC] seccomp: Implement syscall isolation based on memory areas Kees Cook
2020-05-31 5:56 ` Gabriel Krisman Bertazi
2020-05-31 12:39 ` Paul Gofman
2020-05-31 16:49 ` Matthew Wilcox
2020-05-31 17:10 ` Paul Gofman
2020-05-31 17:31 ` Matthew Wilcox
2020-05-31 18:01 ` Paul Gofman
2020-06-01 17:54 ` Gabriel Krisman Bertazi
2020-06-01 17:53 ` Gabriel Krisman Bertazi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).