* [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) @ 2016-07-19 9:32 Sargun Dhillon 2016-07-19 11:17 ` Daniel Borkmann 0 siblings, 1 reply; 12+ messages in thread From: Sargun Dhillon @ 2016-07-19 9:32 UTC (permalink / raw) To: linux-kernel, netdev; +Cc: alexei.starovoitov, daniel This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. This call shouldn't sleep on any architectures based on review. It was tested with the tracex7 program on x86-64. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> --- include/uapi/linux/bpf.h | 11 +++++++++++ kernel/trace/bpf_trace.c | 31 +++++++++++++++++++++++++++++++ samples/bpf/bpf_helpers.h | 2 ++ 3 files changed, 44 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c4d9224..d435f7c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -364,6 +364,17 @@ enum bpf_func_id { */ BPF_FUNC_get_current_task, + /** + * copy_to_user(to, from, len) - Copy a block of data into user space. + * @to: destination address in userspace + * @from: source address on stack + * @len: number of bytes to copy + * Return: + * Returns number of bytes that could not be copied. + * On success, this will be zero + */ + BPF_FUNC_copy_to_user, + __BPF_FUNC_MAX_ID, }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ebfbb7d..45878f3 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -81,6 +81,35 @@ static const struct bpf_func_proto bpf_probe_read_proto = { .arg3_type = ARG_ANYTHING, }; +static u64 bpf_copy_to_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + void *to = (void *) (long) r1; + void *from = (void *) (long) r2; + int size = (int) r3; + struct task_struct *task = current; + + /* check if we're in a user context */ + if (unlikely(in_interrupt())) + return -EINVAL; + if (unlikely(!task || !task->pid)) + return -EINVAL; + + /* Is this a user address, or a kernel address? */ + if (!access_ok(VERIFY_WRITE, to, size)) + return -EINVAL; + + return probe_kernel_write(to, from, size); +} + +static const struct bpf_func_proto bpf_copy_to_user_proto = { + .func = bpf_copy_to_user, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_PTR_TO_RAW_STACK, + .arg3_type = ARG_CONST_STACK_SIZE, +}; + /* * limited trace_printk() * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed @@ -360,6 +389,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) return &bpf_get_smp_processor_id_proto; case BPF_FUNC_perf_event_read: return &bpf_perf_event_read_proto; + case BPF_FUNC_copy_to_user: + return &bpf_copy_to_user_proto; default: return NULL; } diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h index 84e3fd9..a54a26c 100644 --- a/samples/bpf/bpf_helpers.h +++ b/samples/bpf/bpf_helpers.h @@ -41,6 +41,8 @@ static int (*bpf_perf_event_output)(void *ctx, void *map, int index, void *data, (void *) BPF_FUNC_perf_event_output; static int (*bpf_get_stackid)(void *ctx, void *map, int flags) = (void *) BPF_FUNC_get_stackid; +static int (*bpf_copy_to_user)(void *to, void *from, int size) = + (void *) BPF_FUNC_copy_to_user; /* llvm builtin functions that eBPF C program may use to * emit BPF_LD_ABS and BPF_LD_IND instructions -- 2.7.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-19 9:32 [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) Sargun Dhillon @ 2016-07-19 11:17 ` Daniel Borkmann 2016-07-19 16:34 ` Alexei Starovoitov 2016-07-19 17:13 ` Sargun Dhillon 0 siblings, 2 replies; 12+ messages in thread From: Daniel Borkmann @ 2016-07-19 11:17 UTC (permalink / raw) To: Sargun Dhillon; +Cc: linux-kernel, netdev, alexei.starovoitov Hi Sargun, On 07/19/2016 11:32 AM, Sargun Dhillon wrote: > This allows user memory to be written to during the course of a kprobe. > It shouldn't be used to implement any kind of security mechanism > because of TOC-TOU attacks, but rather to debug, divert, and > manipulate execution of semi-cooperative processes. > > Although it uses probe_kernel_write, we limit the address space > the probe can write into by checking the space with access_ok. > This call shouldn't sleep on any architectures based on review. > > It was tested with the tracex7 program on x86-64. > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > Cc: Alexei Starovoitov <ast@kernel.org> [...] > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index ebfbb7d..45878f3 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -81,6 +81,35 @@ static const struct bpf_func_proto bpf_probe_read_proto = { > .arg3_type = ARG_ANYTHING, > }; > > +static u64 bpf_copy_to_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) > +{ > + void *to = (void *) (long) r1; > + void *from = (void *) (long) r2; > + int size = (int) r3; Nit: you have two spaces here? > + struct task_struct *task = current; > + > + /* check if we're in a user context */ > + if (unlikely(in_interrupt())) > + return -EINVAL; > + if (unlikely(!task || !task->pid)) Why !task->pid is needed? Can you point to the code where this is set up as such? I'd assume if that's the case, there's a generic helper for it to determine that the task_struct is a kernel task? > + return -EINVAL; > + > + /* Is this a user address, or a kernel address? */ > + if (!access_ok(VERIFY_WRITE, to, size)) > + return -EINVAL; > + > + return probe_kernel_write(to, from, size); I'm still worried that this can lead to all kind of hard to find bugs or races for user processes, if you make this writable to entire user address space (which is the only thing that access_ok() checks for). What if the BPF program has bugs and writes to wrong addresses for example, introducing bugs in some other, non-intended processes elsewhere? Can this be limited to syscalls only? And if so, to the passed memory only? Have you played around with ptrace() to check whether you could achieve similar functionality (was thinking about things like [1], PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't this be limited to a similar functionality for only the current task. ptrace() utilizes helpers like access_process_vm(), maybe this can similarly be adapted here, too (under the circumstances that sleeping is not allowed)? [1] https://code.google.com/archive/p/ptrace-parasite/ > +} > + > +static const struct bpf_func_proto bpf_copy_to_user_proto = { > + .func = bpf_copy_to_user, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_ANYTHING, > + .arg2_type = ARG_PTR_TO_RAW_STACK, > + .arg3_type = ARG_CONST_STACK_SIZE, Nit: please tab-align '=' like we have in the other *_proto cases in bpf_trace. > +}; > + > /* > * limited trace_printk() > * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed Thanks, Daniel ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-19 11:17 ` Daniel Borkmann @ 2016-07-19 16:34 ` Alexei Starovoitov 2016-07-19 23:19 ` Daniel Borkmann 2016-07-19 17:13 ` Sargun Dhillon 1 sibling, 1 reply; 12+ messages in thread From: Alexei Starovoitov @ 2016-07-19 16:34 UTC (permalink / raw) To: Daniel Borkmann; +Cc: Sargun Dhillon, linux-kernel, netdev On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: > >+ return -EINVAL; > >+ > >+ /* Is this a user address, or a kernel address? */ > >+ if (!access_ok(VERIFY_WRITE, to, size)) > >+ return -EINVAL; > >+ > >+ return probe_kernel_write(to, from, size); > > I'm still worried that this can lead to all kind of hard to find > bugs or races for user processes, if you make this writable to entire > user address space (which is the only thing that access_ok() checks > for). What if the BPF program has bugs and writes to wrong addresses > for example, introducing bugs in some other, non-intended processes > elsewhere? Can this be limited to syscalls only? And if so, to the > passed memory only? my understanding that above code will write only to memory of current process, so impact is contained and in that sense buggy kprobe program is no different from buggy seccomp prorgram. Limiting this to syscalls will make it too limited. I'm in favor of this change, because it allows us to experiment with restartable sequences and lock-free algorithms that need ultrafast access to cpuid without burdening the kernel with stable abi. > Have you played around with ptrace() to check whether you could > achieve similar functionality (was thinking about things like [1], > PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't > this be limited to a similar functionality for only the current task. > ptrace() utilizes helpers like access_process_vm(), maybe this can > similarly be adapted here, too (under the circumstances that sleeping > is not allowed)? If we hack access_process_vm I think at the end it will look like probe_kernel_write. Walking mm requires semaphore, so we would only be able to do it in task_work and there we can do normal copy_to_user just as well, but it will complicate the programs quite a bit, since writes will be asynchronous and batched. Looks like with access_ok+probe_write we can achieve the same with a lot less code. As far as races between user and bpf program, yeah, if process is multithreaded, the other threads may access the same mem that bpf is writing to, but that's no different from reading. For tracing we walk complex datastructures and pointers. They can be changed by user space on the fly and bpf will see garbage. Like we have uprobe+bpf that walks clang c++ internal datastructures to figure out how long it takes clang to process .h headers. There is a lot of fragility in the bpf script, yet it's pretty useful to quickly analyze compile times. I see bpf_copy_to_user to be hugely valuable too, not as a stable interface, but rather as a tool to quickly debug and experiment. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-19 16:34 ` Alexei Starovoitov @ 2016-07-19 23:19 ` Daniel Borkmann 2016-07-20 3:02 ` Alexei Starovoitov 2016-07-20 3:08 ` Sargun Dhillon 0 siblings, 2 replies; 12+ messages in thread From: Daniel Borkmann @ 2016-07-19 23:19 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Sargun Dhillon, linux-kernel, netdev On 07/19/2016 06:34 PM, Alexei Starovoitov wrote: > On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: >>> + return -EINVAL; >>> + >>> + /* Is this a user address, or a kernel address? */ >>> + if (!access_ok(VERIFY_WRITE, to, size)) >>> + return -EINVAL; >>> + >>> + return probe_kernel_write(to, from, size); >> >> I'm still worried that this can lead to all kind of hard to find >> bugs or races for user processes, if you make this writable to entire >> user address space (which is the only thing that access_ok() checks >> for). What if the BPF program has bugs and writes to wrong addresses >> for example, introducing bugs in some other, non-intended processes >> elsewhere? Can this be limited to syscalls only? And if so, to the >> passed memory only? > > my understanding that above code will write only to memory of current process, > so impact is contained and in that sense buggy kprobe program is no different > from buggy seccomp prorgram. Compared to seccomp, you might not notice that a race has happened, in seccomp case you might have killed your process, which is visible. But ok, in ptrace() case it might be similar issue perhaps ... The asm-generic version does __access_ok(..) { return 1; } for nommu case, I haven't checked closely enough whether there's actually an arch that uses this, but for example arm nommu with only one addr space would certainly result in access_ok() as 1, and then you could also go into probe_kernel_write(), no? Don't know that code well enough, but I believe the check would only ensure in normal use-cases that user process doesn't fiddle with kernel address space, but not necessarily guarantee that this really only belongs to the process address space. x86 code comments this with "note that, depending on architecture, this function probably just checks that the pointer is in the user space range - after calling this function, memory access functions may still return -EFAULT". Also, what happens in case of kernel thread? As it stands, it does ... if (unlikely(in_interrupt())) return -EINVAL; if (unlikely(!task || !task->pid)) return -EINVAL; So up to here, irq/sirq, NULL current and that current is not the 'idle' process is being checked (still fail to see the point for the !task->pid, I believe the intend here is different). /* Is this a user address, or a kernel address? */ if (!access_ok(VERIFY_WRITE, to, size)) return -EINVAL; Now here. What if it's a kernel thread? You'll have KERNEL_DS segment, task->pid was non-zero as well for the kthread, so access_ok() will pass and you can still execute probe_kernel_write() ... > Limiting this to syscalls will make it too limited. > I'm in favor of this change, because it allows us to experiment > with restartable sequences and lock-free algorithms that need ultrafast > access to cpuid without burdening the kernel with stable abi. > >> Have you played around with ptrace() to check whether you could >> achieve similar functionality (was thinking about things like [1], >> PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't >> this be limited to a similar functionality for only the current task. >> ptrace() utilizes helpers like access_process_vm(), maybe this can >> similarly be adapted here, too (under the circumstances that sleeping >> is not allowed)? > > If we hack access_process_vm I think at the end it will look like > probe_kernel_write. Walking mm requires semaphore, so we would only > be able to do it in task_work and there we can do normal copy_to_user > just as well, but it will complicate the programs quite a bit, since > writes will be asynchronous and batched. > Looks like with access_ok+probe_write we can achieve the same > with a lot less code. I believe it may not quite be the same as it currently stands. No fundamental objection, just that this needs to be made "safe" to the limits you state above yourself. ;) > As far as races between user and bpf program, yeah, if process > is multithreaded, the other threads may access the same mem that > bpf is writing to, but that's no different from reading. > For tracing we walk complex datastructures and pointers. They > can be changed by user space on the fly and bpf will see garbage. > Like we have uprobe+bpf that walks clang c++ internal datastructures > to figure out how long it takes clang to process .h headers. > There is a lot of fragility in the bpf script, yet it's pretty > useful to quickly analyze compile times. > I see bpf_copy_to_user to be hugely valuable too, not as a stable > interface, but rather as a tool to quickly debug and experiment. Right, so maybe there should be a warn once to the dmesg log that this is just experimental? Thanks, Daniel ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-19 23:19 ` Daniel Borkmann @ 2016-07-20 3:02 ` Alexei Starovoitov 2016-07-20 9:58 ` Sargun Dhillon 2016-07-20 10:05 ` Daniel Borkmann 2016-07-20 3:08 ` Sargun Dhillon 1 sibling, 2 replies; 12+ messages in thread From: Alexei Starovoitov @ 2016-07-20 3:02 UTC (permalink / raw) To: Daniel Borkmann; +Cc: Sargun Dhillon, linux-kernel, netdev On Wed, Jul 20, 2016 at 01:19:51AM +0200, Daniel Borkmann wrote: > On 07/19/2016 06:34 PM, Alexei Starovoitov wrote: > >On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: > >>>+ return -EINVAL; > >>>+ > >>>+ /* Is this a user address, or a kernel address? */ > >>>+ if (!access_ok(VERIFY_WRITE, to, size)) > >>>+ return -EINVAL; > >>>+ > >>>+ return probe_kernel_write(to, from, size); > >> > >>I'm still worried that this can lead to all kind of hard to find > >>bugs or races for user processes, if you make this writable to entire > >>user address space (which is the only thing that access_ok() checks > >>for). What if the BPF program has bugs and writes to wrong addresses > >>for example, introducing bugs in some other, non-intended processes > >>elsewhere? Can this be limited to syscalls only? And if so, to the > >>passed memory only? > > > >my understanding that above code will write only to memory of current process, > >so impact is contained and in that sense buggy kprobe program is no different > >from buggy seccomp prorgram. > > Compared to seccomp, you might not notice that a race has happened, > in seccomp case you might have killed your process, which is visible. > But ok, in ptrace() case it might be similar issue perhaps ... > > The asm-generic version does __access_ok(..) { return 1; } for nommu > case, I haven't checked closely enough whether there's actually an arch > that uses this, but for example arm nommu with only one addr space would > certainly result in access_ok() as 1, and then you could also go into > probe_kernel_write(), no? good point. how arm nommu handles copy_to_user? if there is nommu then there is no user/kernel mm ? Crazy archs. I guess we have to disable this helper on all such archs. > Don't know that code well enough, but I believe the check would only > ensure in normal use-cases that user process doesn't fiddle with kernel > address space, but not necessarily guarantee that this really only > belongs to the process address space. why? on x86 that exactly what it does. access_ok=true means it's user space address and since we're in _this user context_ probe_kernel_write can only affect this user. > x86 code comments this with "note that, depending on architecture, > this function probably just checks that the pointer is in the user > space range - after calling this function, memory access functions may > still return -EFAULT". Yes. I've read that comment to :) Certainly not an expert, but the archs I've looked at, access_ok has the same meaning as on x86. They check the space range to make sure address doesn't belong to kernel. Could I have missed something? Certainly. Please double check :) > Also, what happens in case of kernel thread? my understanding if access_ok(addr)=true the addr will never point to memory of kernel thread. We need expert opinion. Whom should we ping? > As it stands, it does ... > > if (unlikely(in_interrupt())) > return -EINVAL; > if (unlikely(!task || !task->pid)) > return -EINVAL; > > So up to here, irq/sirq, NULL current and that current is not the 'idle' > process is being checked (still fail to see the point for the !task->pid, > I believe the intend here is different). > > /* Is this a user address, or a kernel address? */ > if (!access_ok(VERIFY_WRITE, to, size)) > return -EINVAL; > > Now here. What if it's a kernel thread? You'll have KERNEL_DS segment, > task->pid was non-zero as well for the kthread, so access_ok() will > pass and you can still execute probe_kernel_write() ... I think user_addr_max() should be zero for kthread, but worth checking for sure. > >Limiting this to syscalls will make it too limited. > >I'm in favor of this change, because it allows us to experiment > >with restartable sequences and lock-free algorithms that need ultrafast > >access to cpuid without burdening the kernel with stable abi. > > > >>Have you played around with ptrace() to check whether you could > >>achieve similar functionality (was thinking about things like [1], > >>PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't > >>this be limited to a similar functionality for only the current task. > >>ptrace() utilizes helpers like access_process_vm(), maybe this can > >>similarly be adapted here, too (under the circumstances that sleeping > >>is not allowed)? > > > >If we hack access_process_vm I think at the end it will look like > >probe_kernel_write. Walking mm requires semaphore, so we would only > >be able to do it in task_work and there we can do normal copy_to_user > >just as well, but it will complicate the programs quite a bit, since > >writes will be asynchronous and batched. > >Looks like with access_ok+probe_write we can achieve the same > >with a lot less code. > > I believe it may not quite be the same as it currently stands. No > fundamental objection, just that this needs to be made "safe" to the > limits you state above yourself. ;) completely agree :) the more eyes on the patch the better. > >As far as races between user and bpf program, yeah, if process > >is multithreaded, the other threads may access the same mem that > >bpf is writing to, but that's no different from reading. > >For tracing we walk complex datastructures and pointers. They > >can be changed by user space on the fly and bpf will see garbage. > >Like we have uprobe+bpf that walks clang c++ internal datastructures > >to figure out how long it takes clang to process .h headers. > >There is a lot of fragility in the bpf script, yet it's pretty > >useful to quickly analyze compile times. > >I see bpf_copy_to_user to be hugely valuable too, not as a stable > >interface, but rather as a tool to quickly debug and experiment. > > Right, so maybe there should be a warn once to the dmesg log that this > is just experimental? good point. The tracing experience showed that trace_printk warning was quite effective on steering user space assumptions. Let's do something here as well. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-20 3:02 ` Alexei Starovoitov @ 2016-07-20 9:58 ` Sargun Dhillon 2016-07-20 23:00 ` Daniel Borkmann 2016-07-20 10:05 ` Daniel Borkmann 1 sibling, 1 reply; 12+ messages in thread From: Sargun Dhillon @ 2016-07-20 9:58 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Daniel Borkmann, linux-kernel, netdev On Tue, 19 Jul 2016, Alexei Starovoitov wrote: > On Wed, Jul 20, 2016 at 01:19:51AM +0200, Daniel Borkmann wrote: >> On 07/19/2016 06:34 PM, Alexei Starovoitov wrote: >>> On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: >>>>> + return -EINVAL; >>>>> + >>>>> + /* Is this a user address, or a kernel address? */ >>>>> + if (!access_ok(VERIFY_WRITE, to, size)) >>>>> + return -EINVAL; >>>>> + >>>>> + return probe_kernel_write(to, from, size); >>>> >>>> I'm still worried that this can lead to all kind of hard to find >>>> bugs or races for user processes, if you make this writable to entire >>>> user address space (which is the only thing that access_ok() checks >>>> for). What if the BPF program has bugs and writes to wrong addresses >>>> for example, introducing bugs in some other, non-intended processes >>>> elsewhere? Can this be limited to syscalls only? And if so, to the >>>> passed memory only? >>> >>> my understanding that above code will write only to memory of current process, >>> so impact is contained and in that sense buggy kprobe program is no different >>> from buggy seccomp prorgram. >> >> Compared to seccomp, you might not notice that a race has happened, >> in seccomp case you might have killed your process, which is visible. >> But ok, in ptrace() case it might be similar issue perhaps ... >> >> The asm-generic version does __access_ok(..) { return 1; } for nommu >> case, I haven't checked closely enough whether there's actually an arch >> that uses this, but for example arm nommu with only one addr space would >> certainly result in access_ok() as 1, and then you could also go into >> probe_kernel_write(), no? > > good point. how arm nommu handles copy_to_user? if there is nommu > then there is no user/kernel mm ? Crazy archs. > I guess we have to disable this helper on all such archs. > How do y'all do testing / development with nommu architectures? Also, what are the guarantees of memory protection a single address space nommu architecture gives you? It seems like with some of the access_ok implementations there's little protection. It seems to me that the same risk that's in copy_to_user for user provided pointers in those architectures exists in this helper (nil). So, I agree, that we should probably just disable this helper. >> Don't know that code well enough, but I believe the check would only >> ensure in normal use-cases that user process doesn't fiddle with kernel >> address space, but not necessarily guarantee that this really only >> belongs to the process address space. > > why? on x86 that exactly what it does. access_ok=true means > it's user space address and since we're in _this user context_ > probe_kernel_write can only affect this user. > I actually added a second check to ensure that the user context we're in is the active user context. This prevents this helper from effecting other processes during things like a process_vm_writev. >> x86 code comments this with "note that, depending on architecture, >> this function probably just checks that the pointer is in the user >> space range - after calling this function, memory access functions may >> still return -EFAULT". > > Yes. I've read that comment to :) > Certainly not an expert, but the archs I've looked at, access_ok > has the same meaning as on x86. They check the space range to > make sure address doesn't belong to kernel. > Could I have missed something? Certainly. Please double check :) > >> Also, what happens in case of kernel thread? > > my understanding if access_ok(addr)=true the addr will never point > to memory of kernel thread. > We need expert opinion. Whom should we ping? > >> As it stands, it does ... >> >> if (unlikely(in_interrupt())) >> return -EINVAL; >> if (unlikely(!task || !task->pid)) >> return -EINVAL; >> >> So up to here, irq/sirq, NULL current and that current is not the 'idle' >> process is being checked (still fail to see the point for the !task->pid, >> I believe the intend here is different). >> >> /* Is this a user address, or a kernel address? */ >> if (!access_ok(VERIFY_WRITE, to, size)) >> return -EINVAL; >> >> Now here. What if it's a kernel thread? You'll have KERNEL_DS segment, >> task->pid was non-zero as well for the kthread, so access_ok() will >> pass and you can still execute probe_kernel_write() ... > > I think user_addr_max() should be zero for kthread, but > worth checking for sure. > >>> Limiting this to syscalls will make it too limited. >>> I'm in favor of this change, because it allows us to experiment >>> with restartable sequences and lock-free algorithms that need ultrafast >>> access to cpuid without burdening the kernel with stable abi. >>> >>>> Have you played around with ptrace() to check whether you could >>>> achieve similar functionality (was thinking about things like [1], >>>> PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't >>>> this be limited to a similar functionality for only the current task. >>>> ptrace() utilizes helpers like access_process_vm(), maybe this can >>>> similarly be adapted here, too (under the circumstances that sleeping >>>> is not allowed)? >>> >>> If we hack access_process_vm I think at the end it will look like >>> probe_kernel_write. Walking mm requires semaphore, so we would only >>> be able to do it in task_work and there we can do normal copy_to_user >>> just as well, but it will complicate the programs quite a bit, since >>> writes will be asynchronous and batched. >>> Looks like with access_ok+probe_write we can achieve the same >>> with a lot less code. >> >> I believe it may not quite be the same as it currently stands. No >> fundamental objection, just that this needs to be made "safe" to the >> limits you state above yourself. ;) > > completely agree :) the more eyes on the patch the better. > >>> As far as races between user and bpf program, yeah, if process >>> is multithreaded, the other threads may access the same mem that >>> bpf is writing to, but that's no different from reading. >>> For tracing we walk complex datastructures and pointers. They >>> can be changed by user space on the fly and bpf will see garbage. >>> Like we have uprobe+bpf that walks clang c++ internal datastructures >>> to figure out how long it takes clang to process .h headers. >>> There is a lot of fragility in the bpf script, yet it's pretty >>> useful to quickly analyze compile times. >>> I see bpf_copy_to_user to be hugely valuable too, not as a stable >>> interface, but rather as a tool to quickly debug and experiment. >> >> Right, so maybe there should be a warn once to the dmesg log that this >> is just experimental? > > good point. The tracing experience showed that trace_printk warning > was quite effective on steering user space assumptions. Let's do > something here as well. > > So, with that, what about the following: It includes -Desupporting no MMU platforms as we've deemed them incapable of being safe -Checking that we're not in a kthread -Checking that the active mm is the thread's mm -A log message indicating the experimental nature of this helper It does not include: -A heuristic to determine is access_ok is broken, or if the platform didn't implement it. It seems all platforms with MMUs implement it today, and it seems clear to make that platforms should do something better than return 1, if they can commit 04951b1caf63e06024350389cbddbcccf1a4674b Author: Sargun Dhillon <sargun@sargun.me> Date: Mon Jul 18 19:38:58 2016 -0700 bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. This call shouldn't sleep on any architectures based on review. It was tested with the tracex7 program on x86-64. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c4d9224..d435f7c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -364,6 +364,17 @@ enum bpf_func_id { */ BPF_FUNC_get_current_task, + /** + * copy_to_user(to, from, len) - Copy a block of data into user space. + * @to: destination address in userspace + * @from: source address on stack + * @len: number of bytes to copy + * Return: + * Returns number of bytes that could not be copied. + * On success, this will be zero + */ + BPF_FUNC_copy_to_user, + __BPF_FUNC_MAX_ID, }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ebfbb7d..e5f17b0 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -81,6 +81,68 @@ static const struct bpf_func_proto bpf_probe_read_proto = { .arg3_type = ARG_ANYTHING, }; +static void bpf_copy_to_user_warn(void) { + pr_warn_once("**************************************************\n"); + pr_warn_once("* bpf_copy_to_user: Experimental Feature in use *\n"); + pr_warn_once("* bpf_copy_to_user: Feature may corrupt memory *\n"); + pr_warn_once("**************************************************\n"); +} + +static u64 bpf_copy_to_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ +/* + * There isn't enough protection on no MMU architectures for this helper + * to be considered safe + */ +#ifdef CONFIG_MMU + void *to = (void *) (long) r1; + void *from = (void *) (long) r2; + int size = (int) r3; + struct task_struct *task = current; + + /* + * Ensure we're in a user context which it is safe for the helper + * to run. + * Also, this helper has no business in a kthread. + * Although, access_ok should prevent writing to non-user memory + * On some architectures (nommu, etc...) access_ok isn't enough + */ + + if (unlikely(in_interrupt() || !task)) + return -EINVAL; + + if (unlikely(!task->pid || !task->mm || task->flags & PF_KTHREAD)) + return -EINVAL; + + /* + * Make sure that this thread hasn't adopted another process's + * memory context a la process_vm_(read/write)v to avoid unintended + * side effects + */ + if (unlikely(task->mm != task->active_mm)) + return -EINVAL; + + /* Is this a user address, or a kernel address? */ + if (!access_ok(VERIFY_WRITE, to, size)) + return -EINVAL; + + DO_ONCE(bpf_copy_to_user_warn); + + return probe_kernel_write(to, from, size); +#else /* CONFIG_MMU */ + return -ENOTSUP; +#endif /* CONFIG_MMU */ +} + +static const struct bpf_func_proto bpf_copy_to_user_proto = { + .func = bpf_copy_to_user, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_PTR_TO_RAW_STACK, + .arg3_type = ARG_CONST_STACK_SIZE, +}; + /* * limited trace_printk() * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed @@ -360,6 +422,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) return &bpf_get_smp_processor_id_proto; case BPF_FUNC_perf_event_read: return &bpf_perf_event_read_proto; + case BPF_FUNC_copy_to_user: + return &bpf_copy_to_user_proto; default: return NULL; } diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h index 84e3fd9..a54a26c 100644 --- a/samples/bpf/bpf_helpers.h +++ b/samples/bpf/bpf_helpers.h @@ -41,6 +41,8 @@ static int (*bpf_perf_event_output)(void *ctx, void *map, int index, void *data, (void *) BPF_FUNC_perf_event_output; static int (*bpf_get_stackid)(void *ctx, void *map, int flags) = (void *) BPF_FUNC_get_stackid; +static int (*bpf_copy_to_user)(void *to, void *from, int size) = + (void *) BPF_FUNC_copy_to_user; /* llvm builtin functions that eBPF C program may use to * emit BPF_LD_ABS and BPF_LD_IND instructions ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-20 9:58 ` Sargun Dhillon @ 2016-07-20 23:00 ` Daniel Borkmann 2016-07-21 10:47 ` Sargun Dhillon 0 siblings, 1 reply; 12+ messages in thread From: Daniel Borkmann @ 2016-07-20 23:00 UTC (permalink / raw) To: Sargun Dhillon, Alexei Starovoitov; +Cc: linux-kernel, netdev On 07/20/2016 11:58 AM, Sargun Dhillon wrote: [...] > So, with that, what about the following: > It includes > -Desupporting no MMU platforms as we've deemed them incapable of being > safe > -Checking that we're not in a kthread > -Checking that the active mm is the thread's mm > -A log message indicating the experimental nature of this helper > > It does not include: > -A heuristic to determine is access_ok is broken, or if the platform > didn't implement it. It seems all platforms with MMUs implement it today, > and it seems clear to make that platforms should do something better than > return 1, if they can I don't really like couple of things, your ifdef CONFIG_MMU might not be needed I think, couple of these checks seem redundant, (I'm not yet sure about the task->mm != task->active_mm thingy), the helper should definitely be gpl_only and ARG_PTR_TO_RAW_STACK is just buggy. Also, this should be a bit analogue to bpf_probe_read we have. How about something roughly along the lines of below diff (obviously needs extensive testing ...)? This can still do all kind of ugly crap to the user process, but limited to the cap_sys_admin to shoot himself in the foot. include/uapi/linux/bpf.h | 3 +++ kernel/trace/bpf_trace.c | 30 ++++++++++++++++++++++++++++++ 2 files changed, 33 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2b7076f..4d339c6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -365,6 +365,9 @@ enum bpf_func_id { */ BPF_FUNC_get_current_task, + /* Doc goes here ... */ + BPF_FUNC_probe_write, + __BPF_FUNC_MAX_ID, }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index a12bbd3..43a4386c 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -81,6 +81,34 @@ static const struct bpf_func_proto bpf_probe_read_proto = { .arg3_type = ARG_ANYTHING, }; +static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + struct task_struct *task = current; + void *unsafe_ptr = (void *)(long) r1; + void *src = (void *)(long) r2; + int size = (int) r3; + + if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD))) + return -EPERM; + if (segment_eq(get_fs(), KERNEL_DS)) + return -EPERM; + if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) + return -EPERM; + + /* pr_warn_once() barks here ... */ + + return probe_kernel_write(unsafe_ptr, src, size); +} + +static const struct bpf_func_proto bpf_probe_write_proto = { + .func = bpf_probe_write, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_PTR_TO_STACK, + .arg3_type = ARG_CONST_STACK_SIZE, +}; + /* * limited trace_printk() * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed @@ -344,6 +372,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) return &bpf_map_delete_elem_proto; case BPF_FUNC_probe_read: return &bpf_probe_read_proto; + case BPF_FUNC_probe_write: + return &bpf_probe_write_proto; case BPF_FUNC_ktime_get_ns: return &bpf_ktime_get_ns_proto; case BPF_FUNC_tail_call: -- 1.9.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-20 23:00 ` Daniel Borkmann @ 2016-07-21 10:47 ` Sargun Dhillon 2016-07-21 11:30 ` Daniel Borkmann 0 siblings, 1 reply; 12+ messages in thread From: Sargun Dhillon @ 2016-07-21 10:47 UTC (permalink / raw) To: Daniel Borkmann; +Cc: Alexei Starovoitov, linux-kernel, netdev On Thu, Jul 21, 2016 at 01:00:51AM +0200, Daniel Borkmann wrote: > On 07/20/2016 11:58 AM, Sargun Dhillon wrote: > [...] > >So, with that, what about the following: > >It includes > >-Desupporting no MMU platforms as we've deemed them incapable of being > > safe > >-Checking that we're not in a kthread > >-Checking that the active mm is the thread's mm > >-A log message indicating the experimental nature of this helper > > > >It does not include: > >-A heuristic to determine is access_ok is broken, or if the platform > > didn't implement it. It seems all platforms with MMUs implement it today, > > and it seems clear to make that platforms should do something better than > > return 1, if they can > > I don't really like couple of things, your ifdef CONFIG_MMU might not be > needed I think, couple of these checks seem redundant, (I'm not yet sure > about the task->mm != task->active_mm thingy), the helper should definitely > be gpl_only and ARG_PTR_TO_RAW_STACK is just buggy. Also, this should be > a bit analogue to bpf_probe_read we have. How about something roughly along > the lines of below diff (obviously needs extensive testing ...)? This > can still do all kind of ugly crap to the user process, but limited to > the cap_sys_admin to shoot himself in the foot. * You're right about CONFIG_MMU. We don't need it, all of the nommu platforms properly deal with it from my research. It was always ARG_PTR_TO_STACK? Or are you saying ARG_PTR_TO_STACK is buggy and we should make it ARG_PTR_TO_RAW_STACK? I originally named the function bpf_probe_write. Upon further thought I don't think that makes sense. The reason is because bpf_probe_read is analogous to probe_kernel_read. If we had bpf_probe_write, I think people might reason it to be equivalent to probe_kernel_write, and be confused when they can't write to kernel space. I tried to make the external facing documentaion close to copy_to_user. That's how people should use it, not like _write. Therefor I think it makes sense to keep that the name. I added a check for (!task) -- It seems to be spattered throughou the eBPF helper code. Alexei mentioned earlier that it can be null, but I'm not sure of the case RE: task->mm != task->active_mm -- There are a couple scenarios where kthreads do this, and the only user call that should hit this is execve. There's only a brief period where this can be true and I don't think it's worth dealing with that case -- I'm not really sure you can plant a kprobe at the right site either Did some minimal testing with tracex7 and others. I was able to copy memory into other process's space while certain syscalls were going on. I don't think that there are a reasonable set of protections. I'll do more testing. Any suggestions of what might break? I've looked at writing to unitialized memory, Memory out of bounds, etc... Do you know of any "weak spots"? The rest of the helpers return EINVAL on invalid state. I think it makes sense to match them. ENOPERM makes sense for interacting with an illegal chunk of memory, but ENOPERM just because you're in a kthread seems especially weird. > > include/uapi/linux/bpf.h | 3 +++ > kernel/trace/bpf_trace.c | 30 ++++++++++++++++++++++++++++++ > 2 files changed, 33 insertions(+) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 2b7076f..4d339c6 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -365,6 +365,9 @@ enum bpf_func_id { > */ > BPF_FUNC_get_current_task, > > + /* Doc goes here ... */ > + BPF_FUNC_probe_write, > + > __BPF_FUNC_MAX_ID, > }; > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index a12bbd3..43a4386c 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -81,6 +81,34 @@ static const struct bpf_func_proto bpf_probe_read_proto = { > .arg3_type = ARG_ANYTHING, > }; > > +static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) > +{ > + struct task_struct *task = current; > + void *unsafe_ptr = (void *)(long) r1; > + void *src = (void *)(long) r2; > + int size = (int) r3; > + > + if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD))) > + return -EPERM; > + if (segment_eq(get_fs(), KERNEL_DS)) > + return -EPERM; > + if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) > + return -EPERM; > + > + /* pr_warn_once() barks here ... */ > + > + return probe_kernel_write(unsafe_ptr, src, size); > +} > + > +static const struct bpf_func_proto bpf_probe_write_proto = { > + .func = bpf_probe_write, > + .gpl_only = true, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_ANYTHING, > + .arg2_type = ARG_PTR_TO_STACK, > + .arg3_type = ARG_CONST_STACK_SIZE, > +}; > + > /* > * limited trace_printk() > * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed > @@ -344,6 +372,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) > return &bpf_map_delete_elem_proto; > case BPF_FUNC_probe_read: > return &bpf_probe_read_proto; > + case BPF_FUNC_probe_write: > + return &bpf_probe_write_proto; > case BPF_FUNC_ktime_get_ns: > return &bpf_ktime_get_ns_proto; > case BPF_FUNC_tail_call: > -- > 1.9.3 commit e95732cb75e256af621ecaf85fd48d51bf91da1b Author: Sargun Dhillon <sargun@sargun.me> Date: Mon Jul 18 19:38:58 2016 -0700 bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. This call shouldn't sleep on any architectures based on review. It was tested with the tracex7 program on x86-64. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c4d9224..d435f7c 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -364,6 +364,17 @@ enum bpf_func_id { */ BPF_FUNC_get_current_task, + /** + * copy_to_user(to, from, len) - Copy a block of data into user space. + * @to: destination address in userspace + * @from: source address on stack + * @len: number of bytes to copy + * Return: + * Returns number of bytes that could not be copied. + * On success, this will be zero + */ + BPF_FUNC_copy_to_user, + __BPF_FUNC_MAX_ID, }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ebfbb7d..b82b9e9 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -81,6 +81,48 @@ static const struct bpf_func_proto bpf_probe_read_proto = { .arg3_type = ARG_ANYTHING, }; +static u64 bpf_copy_to_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + void *unsafe_ptr = (void *) (long) r1; + void *src = (void *) (long) r2; + int size = (int) r3; + struct task_struct *task = current; + + /* + * Ensure we're in a user context which it is safe for the helper + * to run. + * Also, this helper has no business in a kthread. + * Although, access_ok should prevent writing to non-user memory + * On some architectures (nommu, etc...) access_ok isn't enough + * So we check get_fs() + */ + + if (unlikely(!task)) + return -EINVAL; + if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD))) + return -EINVAL; + if (unlikely(segment_eq(get_fs(), KERNEL_DS))) + return -EINVAL; + if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) + return -EPERM; + + pr_warn_once("**************************************************\n"); + pr_warn_once("* bpf_copy_to_user: Experimental Feature in use *\n"); + pr_warn_once("* bpf_copy_to_user: Feature may corrupt memory *\n"); + pr_warn_once("**************************************************\n"); + + return probe_kernel_write(unsafe_ptr, src, size); +} + +static const struct bpf_func_proto bpf_copy_to_user_proto = { + .func = bpf_copy_to_user, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_PTR_TO_STACK, + .arg3_type = ARG_CONST_STACK_SIZE, +}; + /* * limited trace_printk() * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers allowed @@ -360,6 +402,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id) return &bpf_get_smp_processor_id_proto; case BPF_FUNC_perf_event_read: return &bpf_perf_event_read_proto; + case BPF_FUNC_copy_to_user: + return &bpf_copy_to_user_proto; default: return NULL; } diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h index 84e3fd9..a54a26c 100644 --- a/samples/bpf/bpf_helpers.h +++ b/samples/bpf/bpf_helpers.h @@ -41,6 +41,8 @@ static int (*bpf_perf_event_output)(void *ctx, void *map, int index, void *data, (void *) BPF_FUNC_perf_event_output; static int (*bpf_get_stackid)(void *ctx, void *map, int flags) = (void *) BPF_FUNC_get_stackid; +static int (*bpf_copy_to_user)(void *to, void *from, int size) = + (void *) BPF_FUNC_copy_to_user; /* llvm builtin functions that eBPF C program may use to * emit BPF_LD_ABS and BPF_LD_IND instructions ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-21 10:47 ` Sargun Dhillon @ 2016-07-21 11:30 ` Daniel Borkmann 0 siblings, 0 replies; 12+ messages in thread From: Daniel Borkmann @ 2016-07-21 11:30 UTC (permalink / raw) To: Sargun Dhillon; +Cc: Alexei Starovoitov, linux-kernel, netdev On 07/21/2016 12:47 PM, Sargun Dhillon wrote: > On Thu, Jul 21, 2016 at 01:00:51AM +0200, Daniel Borkmann wrote: [...] >> I don't really like couple of things, your ifdef CONFIG_MMU might not be >> needed I think, couple of these checks seem redundant, (I'm not yet sure >> about the task->mm != task->active_mm thingy), the helper should definitely >> be gpl_only and ARG_PTR_TO_RAW_STACK is just buggy. Also, this should be >> a bit analogue to bpf_probe_read we have. How about something roughly along >> the lines of below diff (obviously needs extensive testing ...)? This >> can still do all kind of ugly crap to the user process, but limited to >> the cap_sys_admin to shoot himself in the foot. > * You're right about CONFIG_MMU. We don't need it, all of the nommu platforms > properly deal with it from my research. The segment_eq() test should generically catch this from what I see. > It was always ARG_PTR_TO_STACK? Or are you saying ARG_PTR_TO_STACK is buggy and > we should make it ARG_PTR_TO_RAW_STACK? No, in your patch, you had '+ .arg2_type = ARG_PTR_TO_RAW_STACK,', which is not correct as it means you don't need to initialize the memory you pass in for your *src pointer. I believe you took this over from probe_read(), but there it's correct. ARG_PTR_TO_STACK means the verifier checks that it's initialized with data. > I originally named the function bpf_probe_write. Upon further thought I don't > think that makes sense. The reason is because bpf_probe_read is analogous to > probe_kernel_read. If we had bpf_probe_write, I think people might reason it to > be equivalent to probe_kernel_write, and be confused when they can't write to > kernel space. I still think that bpf_probe_write is the more appropriate name, and that the -EPERM are also better than -EINVAL. For user space, you'll have the bpf_probe_read() / bpf_probe_write() pair you can use, which is the more intuitive complement, also since people might already use bpf_probe_read(), so they kind of can infer its meaning. It's just that the kernel doesn't give you _permission_ to mess with kernel memory, hence due to not being allowed -EPERM to make this absolutely clear to the user that this is illegal. -EINVAL usually means one of the function arguments might be wrong, I think -EPERM is a better / more clear fit in this case, imho. > I tried to make the external facing documentaion close to copy_to_user. That's > how people should use it, not like _write. Therefor I think it makes sense to > keep that the name. But still, you *probe* to write somewhere to the process' address space, so it can still fail with -EFAULT. Other than that, see comment above. > I added a check for (!task) -- It seems to be spattered throughou the eBPF > helper code. Alexei mentioned earlier that it can be null, but I'm not sure of > the case Well, the test of unlikely(in_interrupt() || (task->flags & PF_KTHREAD)) will cover these cases. It makes sure that you're neither in hardirq (NULL here) nor softirq and that you're not in a kthread. > RE: task->mm != task->active_mm -- There are a couple scenarios where kthreads > do this, and the only user call that should hit this is execve. There's only a > brief period where this can be true and I don't think it's worth dealing with > that case -- I'm not really sure you can plant a kprobe at the right site either But if kthreads do this, wouldn't this already be covered by task->flags & PF_KTHREAD test for such case? > Did some minimal testing with tracex7 and others. Ok, great! > I was able to copy memory into other process's space while certain > syscalls were going on. I don't think that there are a reasonable set of > protections. Right, it doesn't prevent that syscalls going on in parallel. > I'll do more testing. Any suggestions of what might break? I've looked at > writing to unitialized memory, Memory out of bounds, etc... Do you know of any > "weak spots"? Well, you could write into text/data, stack, etc for the user process so quite a bit. ;) Or do you mean wrt kernel space? If someone runs some binary installing such a proglet, I think it would also make sense to print a message (rate-limited) to the klog with current->comm, task_pid_nr(current) for the process installing this, from verifier side I mean. Maybe makes more sense than the print-once from the helper side. Thanks, Daniel ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-20 3:02 ` Alexei Starovoitov 2016-07-20 9:58 ` Sargun Dhillon @ 2016-07-20 10:05 ` Daniel Borkmann 1 sibling, 0 replies; 12+ messages in thread From: Daniel Borkmann @ 2016-07-20 10:05 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Sargun Dhillon, linux-kernel, netdev On 07/20/2016 05:02 AM, Alexei Starovoitov wrote: > On Wed, Jul 20, 2016 at 01:19:51AM +0200, Daniel Borkmann wrote: >> On 07/19/2016 06:34 PM, Alexei Starovoitov wrote: >>> On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: >>>>> + return -EINVAL; >>>>> + >>>>> + /* Is this a user address, or a kernel address? */ >>>>> + if (!access_ok(VERIFY_WRITE, to, size)) >>>>> + return -EINVAL; >>>>> + >>>>> + return probe_kernel_write(to, from, size); >>>> >>>> I'm still worried that this can lead to all kind of hard to find >>>> bugs or races for user processes, if you make this writable to entire >>>> user address space (which is the only thing that access_ok() checks >>>> for). What if the BPF program has bugs and writes to wrong addresses >>>> for example, introducing bugs in some other, non-intended processes >>>> elsewhere? Can this be limited to syscalls only? And if so, to the >>>> passed memory only? >>> >>> my understanding that above code will write only to memory of current process, >>> so impact is contained and in that sense buggy kprobe program is no different >> >from buggy seccomp prorgram. >> >> Compared to seccomp, you might not notice that a race has happened, >> in seccomp case you might have killed your process, which is visible. >> But ok, in ptrace() case it might be similar issue perhaps ... >> >> The asm-generic version does __access_ok(..) { return 1; } for nommu >> case, I haven't checked closely enough whether there's actually an arch >> that uses this, but for example arm nommu with only one addr space would >> certainly result in access_ok() as 1, and then you could also go into >> probe_kernel_write(), no? > > good point. how arm nommu handles copy_to_user? if there is nommu Should probably boil down to something similar as plain memcpy(). > then there is no user/kernel mm ? Crazy archs. > I guess we have to disable this helper on all such archs. > >> Don't know that code well enough, but I believe the check would only >> ensure in normal use-cases that user process doesn't fiddle with kernel >> address space, but not necessarily guarantee that this really only >> belongs to the process address space. > > why? on x86 that exactly what it does. access_ok=true means > it's user space address and since we're in _this user context_ > probe_kernel_write can only affect this user. > >> x86 code comments this with "note that, depending on architecture, >> this function probably just checks that the pointer is in the user >> space range - after calling this function, memory access functions may >> still return -EFAULT". > > Yes. I've read that comment to :) > Certainly not an expert, but the archs I've looked at, access_ok > has the same meaning as on x86. They check the space range to > make sure address doesn't belong to kernel. > Could I have missed something? Certainly. Please double check :) > >> Also, what happens in case of kernel thread? > > my understanding if access_ok(addr)=true the addr will never point > to memory of kernel thread. If you're coming from user context only, this should be true, it'll check whether it's some user space pointer. >> As it stands, it does ... >> >> if (unlikely(in_interrupt())) >> return -EINVAL; >> if (unlikely(!task || !task->pid)) >> return -EINVAL; >> >> So up to here, irq/sirq, NULL current and that current is not the 'idle' >> process is being checked (still fail to see the point for the !task->pid, >> I believe the intend here is different). >> >> /* Is this a user address, or a kernel address? */ >> if (!access_ok(VERIFY_WRITE, to, size)) >> return -EINVAL; >> >> Now here. What if it's a kernel thread? You'll have KERNEL_DS segment, >> task->pid was non-zero as well for the kthread, so access_ok() will >> pass and you can still execute probe_kernel_write() ... > > I think user_addr_max() should be zero for kthread, but > worth checking for sure. It's 0xffffffffffffffff, I did a quick test yesterday night with creating a kthread, so access_ok() should pass for such case. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-19 23:19 ` Daniel Borkmann 2016-07-20 3:02 ` Alexei Starovoitov @ 2016-07-20 3:08 ` Sargun Dhillon 1 sibling, 0 replies; 12+ messages in thread From: Sargun Dhillon @ 2016-07-20 3:08 UTC (permalink / raw) To: Daniel Borkmann; +Cc: Alexei Starovoitov, linux-kernel, netdev On Wed, 20 Jul 2016, Daniel Borkmann wrote: > On 07/19/2016 06:34 PM, Alexei Starovoitov wrote: >> On Tue, Jul 19, 2016 at 01:17:53PM +0200, Daniel Borkmann wrote: >>>> + return -EINVAL; >>>> + >>>> + /* Is this a user address, or a kernel address? */ >>>> + if (!access_ok(VERIFY_WRITE, to, size)) >>>> + return -EINVAL; >>>> + >>>> + return probe_kernel_write(to, from, size); >>> >>> I'm still worried that this can lead to all kind of hard to find >>> bugs or races for user processes, if you make this writable to entire >>> user address space (which is the only thing that access_ok() checks >>> for). What if the BPF program has bugs and writes to wrong addresses >>> for example, introducing bugs in some other, non-intended processes >>> elsewhere? Can this be limited to syscalls only? And if so, to the >>> passed memory only? >> >> my understanding that above code will write only to memory of current >> process, >> so impact is contained and in that sense buggy kprobe program is no >> different >> from buggy seccomp prorgram. > > Compared to seccomp, you might not notice that a race has happened, > in seccomp case you might have killed your process, which is visible. > But ok, in ptrace() case it might be similar issue perhaps ... > > The asm-generic version does __access_ok(..) { return 1; } for nommu > case, I haven't checked closely enough whether there's actually an arch > that uses this, but for example arm nommu with only one addr space would > certainly result in access_ok() as 1, and then you could also go into > probe_kernel_write(), no? > > Don't know that code well enough, but I believe the check would only > ensure in normal use-cases that user process doesn't fiddle with kernel > address space, but not necessarily guarantee that this really only > belongs to the process address space. > You're right, it doesn't gaurantee that the memory being written to is the user, but only the current process should be mapped into that chunk of memory that is being operated at in the point in time. I think it'd be fairly difficult to write to another process' memory to corrupt it. It also might make sense to me to call access_ok with a known kernel address, and either deny access to the bpf helper, or warn in dmesg saying that it is running in an environment without any protection. -- I'm not sure if there is a preferred kernel address to check? What do you think? > x86 code comments this with "note that, depending on architecture, > this function probably just checks that the pointer is in the user > space range - after calling this function, memory access functions may > still return -EFAULT". > > Also, what happens in case of kernel thread? It seems fairly reasonable to me to check task->flags & PF_KTHREAD in order to prevent this scenario since there is no memory to copy_to_user to. > > As it stands, it does ... > > if (unlikely(in_interrupt())) > return -EINVAL; > if (unlikely(!task || !task->pid)) > return -EINVAL; > > So up to here, irq/sirq, NULL current and that current is not the 'idle' > process is being checked (still fail to see the point for the !task->pid, > I believe the intend here is different). > > /* Is this a user address, or a kernel address? */ > if (!access_ok(VERIFY_WRITE, to, size)) > return -EINVAL; > > Now here. What if it's a kernel thread? You'll have KERNEL_DS segment, > task->pid was non-zero as well for the kthread, so access_ok() will > pass and you can still execute probe_kernel_write() ... > >> Limiting this to syscalls will make it too limited. >> I'm in favor of this change, because it allows us to experiment >> with restartable sequences and lock-free algorithms that need ultrafast >> access to cpuid without burdening the kernel with stable abi. >> >>> Have you played around with ptrace() to check whether you could >>> achieve similar functionality (was thinking about things like [1], >>> PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't >>> this be limited to a similar functionality for only the current task. >>> ptrace() utilizes helpers like access_process_vm(), maybe this can >>> similarly be adapted here, too (under the circumstances that sleeping >>> is not allowed)? >> >> If we hack access_process_vm I think at the end it will look like >> probe_kernel_write. Walking mm requires semaphore, so we would only >> be able to do it in task_work and there we can do normal copy_to_user >> just as well, but it will complicate the programs quite a bit, since >> writes will be asynchronous and batched. >> Looks like with access_ok+probe_write we can achieve the same >> with a lot less code. > > I believe it may not quite be the same as it currently stands. No > fundamental objection, just that this needs to be made "safe" to the > limits you state above yourself. ;) > At least for my current use cases, doing this asynchronously could prove to be unworkable. >> As far as races between user and bpf program, yeah, if process >> is multithreaded, the other threads may access the same mem that >> bpf is writing to, but that's no different from reading. >> For tracing we walk complex datastructures and pointers. They >> can be changed by user space on the fly and bpf will see garbage. >> Like we have uprobe+bpf that walks clang c++ internal datastructures >> to figure out how long it takes clang to process .h headers. >> There is a lot of fragility in the bpf script, yet it's pretty >> useful to quickly analyze compile times. >> I see bpf_copy_to_user to be hugely valuable too, not as a stable >> interface, but rather as a tool to quickly debug and experiment. > > Right, so maybe there should be a warn once to the dmesg log that this > is just experimental? > > Thanks, > Daniel > It seems reasonable to me to give people a heads up, that they have bpf programs that can modify user memory. I can add a warn once with along the lines of: pr_warn_once("bpf_copy_to_user in use: Experimental feature, may corrupt memory"); --- I think that should scare people away to the downsides. That warning along with verifying that the access_ok function is doing useful things on their platform, and warning if it isn't seems to be a strong enough signal to people that might accidentally abuse it. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) 2016-07-19 11:17 ` Daniel Borkmann 2016-07-19 16:34 ` Alexei Starovoitov @ 2016-07-19 17:13 ` Sargun Dhillon 1 sibling, 0 replies; 12+ messages in thread From: Sargun Dhillon @ 2016-07-19 17:13 UTC (permalink / raw) To: Daniel Borkmann; +Cc: linux-kernel, netdev, alexei.starovoitov On Tue, 19 Jul 2016, Daniel Borkmann wrote: > Hi Sargun, > > On 07/19/2016 11:32 AM, Sargun Dhillon wrote: >> This allows user memory to be written to during the course of a kprobe. >> It shouldn't be used to implement any kind of security mechanism >> because of TOC-TOU attacks, but rather to debug, divert, and >> manipulate execution of semi-cooperative processes. >> >> Although it uses probe_kernel_write, we limit the address space >> the probe can write into by checking the space with access_ok. >> This call shouldn't sleep on any architectures based on review. >> >> It was tested with the tracex7 program on x86-64. >> >> Signed-off-by: Sargun Dhillon <sargun@sargun.me> >> Cc: Alexei Starovoitov <ast@kernel.org> > [...] >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >> index ebfbb7d..45878f3 100644 >> --- a/kernel/trace/bpf_trace.c >> +++ b/kernel/trace/bpf_trace.c >> @@ -81,6 +81,35 @@ static const struct bpf_func_proto bpf_probe_read_proto >> = { >> .arg3_type = ARG_ANYTHING, >> }; >> >> +static u64 bpf_copy_to_user(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) >> +{ >> + void *to = (void *) (long) r1; >> + void *from = (void *) (long) r2; >> + int size = (int) r3; > > Nit: you have two spaces here? > >> + struct task_struct *task = current; >> + >> + /* check if we're in a user context */ >> + if (unlikely(in_interrupt())) >> + return -EINVAL; >> + if (unlikely(!task || !task->pid)) > > Why !task->pid is needed? Can you point to the code where this is > set up as such? I'd assume if that's the case, there's a generic > helper for it to determine that the task_struct is a kernel task? > One example of precedent: http://lxr.free-electrons.com/source/arch/x86/kernel/process.c#L270 Also, while testing this out, I instrumented ip_route_output_flow, and I found that there were cases that were invoked from the kernel there current was not NULL, but task->pid was reliably 0. >> + return -EINVAL; >> + >> + /* Is this a user address, or a kernel address? */ >> + if (!access_ok(VERIFY_WRITE, to, size)) >> + return -EINVAL; >> + >> + return probe_kernel_write(to, from, size); > > I'm still worried that this can lead to all kind of hard to find > bugs or races for user processes, if you make this writable to entire > user address space (which is the only thing that access_ok() checks > for). What if the BPF program has bugs and writes to wrong addresses > for example, introducing bugs in some other, non-intended processes > elsewhere? Can this be limited to syscalls only? And if so, to the > passed memory only? > > Have you played around with ptrace() to check whether you could > achieve similar functionality (was thinking about things like [1], > PTRACE_PEEK{TEXT,DATA} / PTRACE_POKE{TEXT,DATA}). If not, why can't > this be limited to a similar functionality for only the current task. > ptrace() utilizes helpers like access_process_vm(), maybe this can > similarly be adapted here, too (under the circumstances that sleeping > is not allowed)? > > [1] https://code.google.com/archive/p/ptrace-parasite/ > I agree that this can get really complicated if it clashes with memory that a task is currently manipulating while we do the copy_to_user. I highlighted that in one of the commit messages not to rely on this behaviour for anything "real" because TOC-TOU. We could limit the use to be within syscalls and uprobes if you think that's the best idea. Is there an easy way to find out, but I think it's somewhat difficult to restrict it to passed memory only. In the example app there's only one level of indirection that occurs, and without some new rules in the verifier, I don't how to handle this. A simple, problematic example is dealing with iovecs, and process_vm_readv, and process_vm_writev, where to find the actual buffer content I need to dereference once to get the list of iovecs, and then again, to find the buffer the iovec points to. We looked at using ptrace, but our software is often working with blackbox code that sometimes does not like to ptrace'd. ptrace is kinda also a pain to work with, and has the same potential for data races you highlighted earlier. In fact, we have a demo of tracex7 that uses ptrace + process_vm_writev + seccomp, and it's much more complicated and fragile. Just casually poking around, it looks like access_process_vm sleeps, resulting in a whole new set of complexities that Alexei highlighted earlier. >> +} >> + >> +static const struct bpf_func_proto bpf_copy_to_user_proto = { >> + .func = bpf_copy_to_user, >> + .gpl_only = false, >> + .ret_type = RET_INTEGER, >> + .arg1_type = ARG_ANYTHING, >> + .arg2_type = ARG_PTR_TO_RAW_STACK, >> + .arg3_type = ARG_CONST_STACK_SIZE, > > Nit: please tab-align '=' like we have in the other *_proto cases in > bpf_trace. > >> +}; >> + >> /* >> * limited trace_printk() >> * only %d %u %x %ld %lu %lx %lld %llu %llx %p %s conversion specifiers >> allowed > > Thanks, > Daniel > Thanks for the review. Unless there's any other input, I'll go ahead and resubmit the patchset with the formatting nits fixed. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-07-21 11:30 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-07-19 9:32 [PATCH net-next v3 1/2] bpf: Add bpf_copy_to_user BPF helper to be called in tracers (kprobes) Sargun Dhillon 2016-07-19 11:17 ` Daniel Borkmann 2016-07-19 16:34 ` Alexei Starovoitov 2016-07-19 23:19 ` Daniel Borkmann 2016-07-20 3:02 ` Alexei Starovoitov 2016-07-20 9:58 ` Sargun Dhillon 2016-07-20 23:00 ` Daniel Borkmann 2016-07-21 10:47 ` Sargun Dhillon 2016-07-21 11:30 ` Daniel Borkmann 2016-07-20 10:05 ` Daniel Borkmann 2016-07-20 3:08 ` Sargun Dhillon 2016-07-19 17:13 ` Sargun Dhillon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).