* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree @ 2014-08-22 19:22 Oleg Nesterov 2014-08-22 20:15 ` Cyrill Gorcunov 0 siblings, 1 reply; 11+ messages in thread From: Oleg Nesterov @ 2014-08-22 19:22 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel Hi Cyrill, I think the patch is fine but I can't understand the usage of mmap_sem and alloc_lock, > + stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack); OK, find_vma() needs mmap_sem. But otherwise, why this should be called under down_read(&mm->mmap_sem) ? What this lock tries to protect? > + if (prctl_map.auxv_size) { > + up_read(&mm->mmap_sem); > + memset(user_auxv, 0, sizeof(user_auxv)); > + error = copy_from_user(user_auxv, > + (const void __user *)prctl_map.auxv, > + prctl_map.auxv_size); > + down_read(&mm->mmap_sem); And if we actually need this lock, why it is safe to drop it temporary? And why we can't move this copy_from_user() up before down_read) in any case? > + if (prctl_map.auxv_size) { > + /* Last entry must be AT_NULL as specification requires */ > + user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL; > + user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL; > + > + task_lock(current); > + memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); > + task_unlock(current); Again, could you explain this task_lock() ? Oleg. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-22 19:22 + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree Oleg Nesterov @ 2014-08-22 20:15 ` Cyrill Gorcunov 2014-08-23 11:53 ` Oleg Nesterov 0 siblings, 1 reply; 11+ messages in thread From: Cyrill Gorcunov @ 2014-08-22 20:15 UTC (permalink / raw) To: Oleg Nesterov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On Fri, Aug 22, 2014 at 09:22:41PM +0200, Oleg Nesterov wrote: > Hi Cyrill, > > I think the patch is fine but I can't understand the usage of mmap_sem > and alloc_lock, > > > + stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack); > > OK, find_vma() needs mmap_sem. But otherwise, why this should be called > under down_read(&mm->mmap_sem) ? What this lock tries to protect? It should protect from allocation/devetion/mergin of another vma. IOW when I lookup for vma I need to be sure it exist and won't disappear at least while I validate it. > > + if (prctl_map.auxv_size) { > > + up_read(&mm->mmap_sem); > > + memset(user_auxv, 0, sizeof(user_auxv)); > > + error = copy_from_user(user_auxv, > > + (const void __user *)prctl_map.auxv, > > + prctl_map.auxv_size); > > + down_read(&mm->mmap_sem); > > And if we actually need this lock, why it is safe to drop it temporary? > And why we can't move this copy_from_user() up before down_read) in any > case? Good point, thanks! I agreed that better to move this copying of user_auxv before taking read-lock (once Andrew answer me my question about copying of offsets in prctl_mm_map structure, I'll update on top). > > + if (prctl_map.auxv_size) { > > + /* Last entry must be AT_NULL as specification requires */ > > + user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL; > > + user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL; > > + > > + task_lock(current); > > + memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); > > + task_unlock(current); > > Again, could you explain this task_lock() ? It is used for serialization access to saved_auxv, ie when we fill it with new data the other reader (via procfs interface) should wait until we finish. Thanks for comments, Oleg! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-22 20:15 ` Cyrill Gorcunov @ 2014-08-23 11:53 ` Oleg Nesterov 2014-08-23 12:22 ` Cyrill Gorcunov 0 siblings, 1 reply; 11+ messages in thread From: Oleg Nesterov @ 2014-08-23 11:53 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On 08/23, Cyrill Gorcunov wrote: > > On Fri, Aug 22, 2014 at 09:22:41PM +0200, Oleg Nesterov wrote: > > Hi Cyrill, > > > > I think the patch is fine but I can't understand the usage of mmap_sem > > and alloc_lock, > > > > > + stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack); > > > > OK, find_vma() needs mmap_sem. But otherwise, why this should be called > > under down_read(&mm->mmap_sem) ? What this lock tries to protect? > > It should protect from allocation/devetion/mergin of another vma. IOW when > I lookup for vma I need to be sure it exist and won't disappear at least > while I validate it. plus you need mmap_sem (at least for reading) when you update mm_struct, this is clear. My question was why the whole function should be called under mmap_sem? It could take it only around find_vma() + check(RLIMIT_STACK) ? In fact I do not think we need this vma_stack/RLIMIT_STACK check at all. It buys nithing and looks strange. RLIMIT_STACK is mostly for self-debugging, to catch the, say, unlimited recursion. An application can trivially create a stack region of arbitrary size. I'd seriously suggest to remove it. > > > + if (prctl_map.auxv_size) { > > > + /* Last entry must be AT_NULL as specification requires */ > > > + user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL; > > > + user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL; > > > + > > > + task_lock(current); > > > + memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); > > > + task_unlock(current); > > > > Again, could you explain this task_lock() ? > > It is used for serialization access to saved_auxv, ie when we fill it > with new data the other reader (via procfs interface) should wait until > we finish. But proc_pid_auxv() doesn't take this lock? And even if it did, this lock can't help. task_lock() is per-thread, and multiple threads (including CLONE_VM tasks, vfork() for example) can share the same ->mm. This certainly doesn't look right. Oleg. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 11:53 ` Oleg Nesterov @ 2014-08-23 12:22 ` Cyrill Gorcunov 2014-08-23 13:14 ` Oleg Nesterov 0 siblings, 1 reply; 11+ messages in thread From: Cyrill Gorcunov @ 2014-08-23 12:22 UTC (permalink / raw) To: Oleg Nesterov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On Sat, Aug 23, 2014 at 01:53:02PM +0200, Oleg Nesterov wrote: > > > > It should protect from allocation/devetion/mergin of another vma. IOW when > > I lookup for vma I need to be sure it exist and won't disappear at least > > while I validate it. > > plus you need mmap_sem (at least for reading) when you update mm_struct, > this is clear. > > My question was why the whole function should be called under mmap_sem? > It could take it only around find_vma() + check(RLIMIT_STACK) ? Stricktly speaking yes, but don't forget we might need to update exe::file as well which requires lock to be taken. So it is simplier to take the read-lock for the whole function. > In fact I do not think we need this vma_stack/RLIMIT_STACK check at all. > It buys nithing and looks strange. RLIMIT_STACK is mostly for self-debugging, > to catch the, say, unlimited recursion. An application can trivially > create a stack region of arbitrary size. I'd seriously suggest to remove it. Look, allocate stack for self is not a problem (we do this for our parasite code which executes inside dumpee address space) but RLIMIT_STACK check is present in ipc shmem so I think we still need this check in a sake of consistency. (note this code doesn't require any special caps so I need to use as much checks/tests as possible). > > > > > + if (prctl_map.auxv_size) { > > > > + /* Last entry must be AT_NULL as specification requires */ > > > > + user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL; > > > > + user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL; > > > > + > > > > + task_lock(current); > > > > + memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); > > > > + task_unlock(current); > > > > > > Again, could you explain this task_lock() ? > > > > It is used for serialization access to saved_auxv, ie when we fill it > > with new data the other reader (via procfs interface) should wait until > > we finish. > > But proc_pid_auxv() doesn't take this lock? And even if it did, this lock > can't help. task_lock() is per-thread, and multiple threads (including > CLONE_VM tasks, vfork() for example) can share the same ->mm. > > This certainly doesn't look right. It takes this lock but indeed this won't help much. Looks like I need to use cred_guard_mutex instead of task_lock here, no? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 12:22 ` Cyrill Gorcunov @ 2014-08-23 13:14 ` Oleg Nesterov 2014-08-23 13:30 ` Oleg Nesterov 0 siblings, 1 reply; 11+ messages in thread From: Oleg Nesterov @ 2014-08-23 13:14 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On 08/23, Cyrill Gorcunov wrote: > > On Sat, Aug 23, 2014 at 01:53:02PM +0200, Oleg Nesterov wrote: > > > > > > It should protect from allocation/devetion/mergin of another vma. IOW when > > > I lookup for vma I need to be sure it exist and won't disappear at least > > > while I validate it. > > > > plus you need mmap_sem (at least for reading) when you update mm_struct, > > this is clear. > > > > My question was why the whole function should be called under mmap_sem? > > It could take it only around find_vma() + check(RLIMIT_STACK) ? > > Stricktly speaking yes, but don't forget we might need to update > exe::file as well which requires lock to be taken. For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably this function was added by another patch. But, if this function calls set_mm_exe_file() (I guess it does?) then down_read() is not enough? set_mm_exe_file() can race with itself. And this still doesn't answer my question. As I said, I understand that we need mmap_sem to update mm_struct, and this is what prctl_set_mm_map() does at the end. And it also calls prctl_set_mm_exe_file_locked(), validate_prctl_map_locked() doesn't do this. > So it is simplier > to take the read-lock for the whole function. Still can't understand why validate_prctl_map_locked() should be called under this lock. OK, I won't insist. > > In fact I do not think we need this vma_stack/RLIMIT_STACK check at all. > > It buys nithing and looks strange. RLIMIT_STACK is mostly for self-debugging, > > to catch the, say, unlimited recursion. An application can trivially > > create a stack region of arbitrary size. I'd seriously suggest to remove it. > > Look, allocate stack for self is not a problem (we do this for our parasite > code which executes inside dumpee address space) but RLIMIT_STACK check is > present in ipc shmem so I think we still need this check in a sake of > consistency. But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack is simply unused, we only report it via /proc/. The same for, say, mm->start_code. It seems that only start_brk/end_data/brk need some validation. Perhaps something else, I didn't try to verify. So why do we need these confusing checks? > > > > > + if (prctl_map.auxv_size) { > > > > > + /* Last entry must be AT_NULL as specification requires */ > > > > > + user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL; > > > > > + user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL; > > > > > + > > > > > + task_lock(current); > > > > > + memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); > > > > > + task_unlock(current); > > > > > > > > Again, could you explain this task_lock() ? > > > > > > It is used for serialization access to saved_auxv, ie when we fill it > > > with new data the other reader (via procfs interface) should wait until > > > we finish. > > > > But proc_pid_auxv() doesn't take this lock? And even if it did, this lock > > can't help. task_lock() is per-thread, and multiple threads (including > > CLONE_VM tasks, vfork() for example) can share the same ->mm. > > > > This certainly doesn't look right. > > It takes this lock Where? Another patch I missed ? ;) > but indeed this won't help much. Yes, it can't help at all. > Looks like I need > to use cred_guard_mutex instead of task_lock here, no? Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold this lock. It does mm_access() which drops this lock after return. And to remind, we are going to remove mm_access/lock_trace from sys_read() paths in proc. Oleg. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 13:14 ` Oleg Nesterov @ 2014-08-23 13:30 ` Oleg Nesterov 2014-08-23 14:18 ` Cyrill Gorcunov 0 siblings, 1 reply; 11+ messages in thread From: Oleg Nesterov @ 2014-08-23 13:30 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel forgot to mention, On 08/23, Oleg Nesterov wrote: > > On 08/23, Cyrill Gorcunov wrote: > > > Looks like I need > > to use cred_guard_mutex instead of task_lock here, no? > > Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold > this lock. It does mm_access() which drops this lock after return. And to remind, > we are going to remove mm_access/lock_trace from sys_read() paths in proc. Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread), suppose that a vfork()'ed child does prctl() while another thread reads the parent's /proc/pid/auxv. Cyrill, I am sorry, but I am starting to think that this patch should be dropped and replaced by another version. Or do you think it would be better to send the fixes on top? Oleg. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 13:30 ` Oleg Nesterov @ 2014-08-23 14:18 ` Cyrill Gorcunov 2014-08-23 15:32 ` Oleg Nesterov 0 siblings, 1 reply; 11+ messages in thread From: Cyrill Gorcunov @ 2014-08-23 14:18 UTC (permalink / raw) To: Oleg Nesterov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On Sat, Aug 23, 2014 at 03:30:01PM +0200, Oleg Nesterov wrote: > forgot to mention, > > On 08/23, Oleg Nesterov wrote: > > > > On 08/23, Cyrill Gorcunov wrote: > > > > > Looks like I need > > > to use cred_guard_mutex instead of task_lock here, no? > > > > Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold > > this lock. It does mm_access() which drops this lock after return. And to remind, > > we are going to remove mm_access/lock_trace from sys_read() paths in proc. > > Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread), > suppose that a vfork()'ed child does prctl() while another thread reads the > parent's /proc/pid/auxv. Then either I need to use some other lock (not sure which one) either leave it completely unlocked mentionin in the man page such lockless behaviour. Thoughts? > Cyrill, I am sorry, but I am starting to think that this patch should be > dropped and replaced by another version. Or do you think it would be better > to send the fixes on top? It's not a problem to drop this particular patch (together with all fixes on top) one and replace it with new version (this looks like a better idea than drowning lkml with series of small patches). I rather need to understand what exactly should be done in new version. So from your previous email | > Stricktly speaking yes, but don't forget we might need to update | > exe::file as well which requires lock to be taken. | | For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably | this function was added by another patch. But, if this function calls | set_mm_exe_file() (I guess it does?) then down_read() is not enough? | set_mm_exe_file() can race with itself. yes, for reading, look in set_mm_exe_file we lookup for vma which should be not present when we change the link, and yes, because of read-only lock this call can race but only one caller success there because we allow to change exe link only once. | But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack | is simply unused, we only report it via /proc/. The same for, say, mm->start_code. that't the good question if this check in do_shmat is buggy or not, why do you think it's a bug there? Oleg, letme summarize all the concerns maybe there would be a way to handle them gracefully 1) How code flows for now (with all fixes on top of current Andrew's queued patches) - obtain struct prctl_mm_map from userspace - copy saved_auxv from userspace - down-read mmap_sem - validate all the data passed from userspace - we need a reference to stack-vma for RLIMIT_STACK check (this is doable, as you said, but until we drop the RLIMIT_STACK from do_shmat I would prefer it to be here) - we need to be sure that start_brk, brk, arg_start, arg_end, env_start, env_end really point to existing VMAs, strictly speaking the probgram can unmap all own VMAs except executable one and continue running without problem but this is not that practical I think and at first iteration I prefer more severe tests here on VMAs - setup new mm::exe_file (we need to be sure the old exe_file is unmapped so mmap_sem read-lock is needed) - update mm::saved_auxv with new values - finally setup new members to struct mm_struct - up-read mmap_sem 2) The qustion is do we really need that read-lock would be taken for all this time? And my answer is yes because of how I implement the checks for start_brk and etc. Oleg, check please if I undersnad you correctly, you propose - drop off mmap_sem completely - don't verify for RLIMIT_STACK - drop off task_lock when updating mm::saved_auxv but still invent how to prevent update/read race right? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 14:18 ` Cyrill Gorcunov @ 2014-08-23 15:32 ` Oleg Nesterov 2014-08-23 16:33 ` Cyrill Gorcunov 0 siblings, 1 reply; 11+ messages in thread From: Oleg Nesterov @ 2014-08-23 15:32 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On 08/23, Cyrill Gorcunov wrote: > > On Sat, Aug 23, 2014 at 03:30:01PM +0200, Oleg Nesterov wrote: > > > > On 08/23, Oleg Nesterov wrote: > > > > > > On 08/23, Cyrill Gorcunov wrote: > > > > > > > Looks like I need > > > > to use cred_guard_mutex instead of task_lock here, no? > > > > > > Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold > > > this lock. It does mm_access() which drops this lock after return. And to remind, > > > we are going to remove mm_access/lock_trace from sys_read() paths in proc. > > > > Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread), > > suppose that a vfork()'ed child does prctl() while another thread reads the > > parent's /proc/pid/auxv. > > Then either I need to use some other lock (not sure which one) either leave it > completely unlocked mentionin in the man page such lockless behaviour. Thoughts? Personally I think "lockless" is the best choice (not sure man page should know about this detail). I mean, I think that we do not care if proc_pid_auxv() prints garbage if it races with ptctl(). Otherwise we have to use mmap_sem in proc_pid_auxv(), doesn't look nice. > | > Stricktly speaking yes, but don't forget we might need to update > | > exe::file as well which requires lock to be taken. > | > | For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably > | this function was added by another patch. But, if this function calls > | set_mm_exe_file() (I guess it does?) then down_read() is not enough? > | set_mm_exe_file() can race with itself. > > yes, for reading, look in set_mm_exe_file we lookup for vma which should > be not present when we change the link, and yes, because of read-only lock > this call can race but only one caller success there because we allow > to change exe link only once. Ah, I forgot about MMF_EXE_FILE_CHANGED, thanks for correcting me. (btw I think this check must die too, but this is off-topic and I was wrong anyway). OK, but I still think down_read(mmap_sem) is not enough. get_mm_exe_file() can do get_file() after prctl() paths do the final fput(). Or please look at tomoyo_get_exe(). Another thread can play with mm->exe_file fput(). Plus I am a bit worrried about inode_permission() under mmap_sem... but this is probably fine. Although you can never know which locks a creative filesystem/security module can take ;) But probably this is fine. > | But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack > | is simply unused, we only report it via /proc/. The same for, say, mm->start_code. > > that't the good question if this check in do_shmat is buggy or not, why do > you think it's a bug there? Please see the patch I sent. > Oleg, letme summarize all the concerns maybe there would be a way to handle > them gracefully > > 1) How code flows for now (with all fixes on top of current Andrew's queued patches) > > - obtain struct prctl_mm_map from userspace > - copy saved_auxv from userspace > - down-read mmap_sem > - validate all the data passed from userspace I won't argue, but at least mmap_min/max_addr do not need mmap_sem. > - we need a reference to stack-vma for RLIMIT_STACK check (this is doable, > as you said, but until we drop the RLIMIT_STACK from do_shmat I would > prefer it to be here) OK, I won't argue, but I think this is pointless and misleading. And btw, where do you see RLIMIT_STACK in do_shmat() ? > - we need to be sure that start_brk, brk, probably yes, simply because the kernel actually uses this members. > arg_start, arg_end, env_start, env_end > really point to existing VMAs, strictly speaking the probgram can unmap > all own VMAs except executable one and continue running without problem > but this is not that practical I think and at first iteration I prefer > more severe tests here on VMAs But, again, for what? There are only used to report this info via /proc/. > - setup new mm::exe_file (we need to be sure the old exe_file is unmapped > so mmap_sem read-lock is needed) See above. > Oleg, check please if I undersnad you correctly, you propose > > - drop off mmap_sem completely No, no, I didn't, we obviously can't do this. > - don't verify for RLIMIT_STACK Yes, and more "don't verify". But again, I won't really argue. Just in my opinion almost all these checks looks misleading, confusing, and unnecessary. Please think about those who will try to understand this code. A little comment like "this is not needed but we all are paranoid in openvz" could make it a bit more understandable ;) > - drop off task_lock when updating mm::saved_auxv but still invent > how to prevent update/read race Personally I think we can simply ignore this race. But let me repeat, I won't argue with any approach as long as I think it is fine correctness wise. Oleg. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 15:32 ` Oleg Nesterov @ 2014-08-23 16:33 ` Cyrill Gorcunov 2014-08-23 19:29 ` Oleg Nesterov 0 siblings, 1 reply; 11+ messages in thread From: Cyrill Gorcunov @ 2014-08-23 16:33 UTC (permalink / raw) To: Oleg Nesterov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On Sat, Aug 23, 2014 at 05:32:22PM +0200, Oleg Nesterov wrote: > > > > > > Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread), > > > suppose that a vfork()'ed child does prctl() while another thread reads the > > > parent's /proc/pid/auxv. > > > > Then either I need to use some other lock (not sure which one) either leave it > > completely unlocked mentionin in the man page such lockless behaviour. Thoughts? > > Personally I think "lockless" is the best choice (not sure man page should > know about this detail). I mean, I think that we do not care if proc_pid_auxv() > prints garbage if it races with ptctl(). > > Otherwise we have to use mmap_sem in proc_pid_auxv(), doesn't look nice. I see, thanks a lot! > > > | > Stricktly speaking yes, but don't forget we might need to update > > | > exe::file as well which requires lock to be taken. > > | > > | For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably > > | this function was added by another patch. But, if this function calls > > | set_mm_exe_file() (I guess it does?) then down_read() is not enough? > > | set_mm_exe_file() can race with itself. > > > > yes, for reading, look in set_mm_exe_file we lookup for vma which should > > be not present when we change the link, and yes, because of read-only lock > > this call can race but only one caller success there because we allow > > to change exe link only once. > > Ah, I forgot about MMF_EXE_FILE_CHANGED, thanks for correcting me. > > (btw I think this check must die too, but this is off-topic and I was > wrong anyway). > > OK, but I still think down_read(mmap_sem) is not enough. get_mm_exe_file() > can do get_file() after prctl() paths do the final fput(). > > Or please look at tomoyo_get_exe(). Another thread can play with mm->exe_file > fput(). > > Plus I am a bit worrried about inode_permission() under mmap_sem... but > this is probably fine. Although you can never know which locks a creative > filesystem/security module can take ;) But probably this is fine. Thanks, I'll re-check. > > | But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack > > | is simply unused, we only report it via /proc/. The same for, say, mm->start_code. > > > > that't the good question if this check in do_shmat is buggy or not, why do > > you think it's a bug there? > > Please see the patch I sent. Already ;) > > Oleg, letme summarize all the concerns maybe there would be a way to handle > > them gracefully > > > > 1) How code flows for now (with all fixes on top of current Andrew's queued patches) > > > > - obtain struct prctl_mm_map from userspace > > - copy saved_auxv from userspace > > - down-read mmap_sem > > - validate all the data passed from userspace > > I won't argue, but at least mmap_min/max_addr do not need mmap_sem. > > > - we need a reference to stack-vma for RLIMIT_STACK check (this is doable, > > as you said, but until we drop the RLIMIT_STACK from do_shmat I would > > prefer it to be here) > > OK, I won't argue, but I think this is pointless and misleading. > > And btw, where do you see RLIMIT_STACK in do_shmat() ? Indirectly, though start_stack pointer. We assign it in setup_arg_pages taking into account RLIMIT_STACK value, then do_shmat operates with start_stack pointer, thus when we setup some new value into start_stack we at least should not exceed old RLIMIT_STACK? This is not anyhow critial I think but I wanted to do as maximal tests as possible. > > > - we need to be sure that start_brk, brk, > > probably yes, simply because the kernel actually uses this members. > > > arg_start, arg_end, env_start, env_end > > really point to existing VMAs, strictly speaking the probgram can unmap > > all own VMAs except executable one and continue running without problem > > but this is not that practical I think and at first iteration I prefer > > more severe tests here on VMAs > > But, again, for what? There are only used to report this info via /proc/. Because it's close to how loading elf proceeds. This prctl is like emulating micro-elf loader ;) Indeed there is no problem for kernel if all these members are pointing to some already umapped VMAs, but earlier we required cap-sysadmin to be grated so that if someone calls for this prctl he must be knowing what he is doing, and if some other unpredicted change in kernel code cause this prctl to panic the kernel such action could be initiated by sysadmin only, not regular user. Now I've released the security requirement so I thought let be more paranoid and require all VMAs to present, all rlimits to be in good shape and such even if the members we're modifying are used for procfs statistics output mostly, we always have a way to relax this tests but not the reverse. That was the main idea ;) > > - setup new mm::exe_file (we need to be sure the old exe_file is unmapped > > so mmap_sem read-lock is needed) > > See above. > > > Oleg, check please if I undersnad you correctly, you propose > > > > - drop off mmap_sem completely > > No, no, I didn't, we obviously can't do this. > > > - don't verify for RLIMIT_STACK > > Yes, and more "don't verify". But again, I won't really argue. Just in my > opinion almost all these checks looks misleading, confusing, and unnecessary. > > Please think about those who will try to understand this code. A little > comment like "this is not needed but we all are paranoid in openvz" could > make it a bit more understandable ;) Yeah, I think I should add this comment into the code. > > > - drop off task_lock when updating mm::saved_auxv but still invent > > how to prevent update/read race > > Personally I think we can simply ignore this race. > > But let me repeat, I won't argue with any approach as long as I think > it is fine correctness wise. I see, thanks a lot for comements Oleg!! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 16:33 ` Cyrill Gorcunov @ 2014-08-23 19:29 ` Oleg Nesterov 2014-08-23 20:11 ` Cyrill Gorcunov 0 siblings, 1 reply; 11+ messages in thread From: Oleg Nesterov @ 2014-08-23 19:29 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On 08/23, Cyrill Gorcunov wrote: > > On Sat, Aug 23, 2014 at 05:32:22PM +0200, Oleg Nesterov wrote: > > > > And btw, where do you see RLIMIT_STACK in do_shmat() ? > > Indirectly, though start_stack pointer. We assign it in setup_arg_pages taking into > account RLIMIT_STACK value, then do_shmat operates with start_stack pointer, > thus when we setup some new value into start_stack we at least should > not exceed old RLIMIT_STACK? Still can't understand how do_shmat() connects to RLIMIT_STACK, even indirectly. setup_arg_pages() obviously uses RLIMIT_STACK, but with or without the patch I sent do_shmat() behaviour does not depend on RLIMIT_STACK at all. > > > arg_start, arg_end, env_start, env_end > > > really point to existing VMAs, strictly speaking the probgram can unmap > > > all own VMAs except executable one and continue running without problem > > > but this is not that practical I think and at first iteration I prefer > > > more severe tests here on VMAs > > > > But, again, for what? There are only used to report this info via /proc/. > > Because it's close to how loading elf proceeds. This prctl is like emulating > micro-elf loader ;) Indeed there is no problem for kernel if all these members > are pointing to some already umapped VMAs, Yes. And whatever checks you add into prctl(), an application can simply do mmap() before prctl() just to fool it, and unmap (say) arg_start right after that. > but earlier we required cap-sysadmin > to be grated so that if someone calls for this prctl he must be knowing what > he is doing, and if some other unpredicted change in kernel code cause this > prctl to panic the kernel such action could be initiated by sysadmin only, > not regular user. Now I've released the security requirement so I thought > let be more paranoid and require all VMAs to present, all rlimits to be > in good shape and such even if the members we're modifying are used for > procfs statistics output mostly, we always have a way to relax this tests > but not the reverse. That was the main idea ;) Following this logic you should also verify that KSTK_ESP(group_leader) falls into ->start_stack area or return an error otherwise. As for elf loader, of course it sets ->start_stack/etc correctly and they can't point to nowehere, but this is only because it actually creates these segments. And I guess the c/r tools should do this "correctly" too, but prctl() itself can never verify this. OK, lets look at this from another angle. Suppose that an application switches to another stack and unmaps ->start_stack area. Now, how are you going to restore it correctly if ->start_stack points to nowhere? Are you going to "fix" ->start_stack? I do not think this would be correct. Or, are you going to create the additional vma just to make prctl() happy? Oleg. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree 2014-08-23 19:29 ` Oleg Nesterov @ 2014-08-23 20:11 ` Cyrill Gorcunov 0 siblings, 0 replies; 11+ messages in thread From: Cyrill Gorcunov @ 2014-08-23 20:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman, H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton, linux-kernel On Sat, Aug 23, 2014 at 09:29:15PM +0200, Oleg Nesterov wrote: > On 08/23, Cyrill Gorcunov wrote: > > > > On Sat, Aug 23, 2014 at 05:32:22PM +0200, Oleg Nesterov wrote: > > > > > > And btw, where do you see RLIMIT_STACK in do_shmat() ? > > > > Indirectly, though start_stack pointer. We assign it in setup_arg_pages taking into > > account RLIMIT_STACK value, then do_shmat operates with start_stack pointer, > > thus when we setup some new value into start_stack we at least should > > not exceed old RLIMIT_STACK? > > Still can't understand how do_shmat() connects to RLIMIT_STACK, even > indirectly. setup_arg_pages() obviously uses RLIMIT_STACK, but with or > without the patch I sent do_shmat() behaviour does not depend on > RLIMIT_STACK at all. setup_arg_pages sets up start_stack member and obeys RLIMIT_STACK limit so if there were some rlimit set the start_stack will point to space allocated with known size, then do_shmat takes start_stack and checks if there space left. So initial value of start_stack is limited to rlimit setting. When setup_arg_pages is called the start_stack has rlimit constrain, so it looks to me pretty natural to do the same in prctl, no? > > > > arg_start, arg_end, env_start, env_end > > > > really point to existing VMAs, strictly speaking the probgram can unmap > > > > all own VMAs except executable one and continue running without problem > > > > but this is not that practical I think and at first iteration I prefer > > > > more severe tests here on VMAs > > > > > > But, again, for what? There are only used to report this info via /proc/. > > > > Because it's close to how loading elf proceeds. This prctl is like emulating > > micro-elf loader ;) Indeed there is no problem for kernel if all these members > > are pointing to some already umapped VMAs, > > Yes. And whatever checks you add into prctl(), an application can simply > do mmap() before prctl() just to fool it, and unmap (say) arg_start right > after that. yes it can, but prctl on itsown actually not much usefull until you need to mimic "another program" procfs output like we do in criu :) > > but earlier we required cap-sysadmin > > to be grated so that if someone calls for this prctl he must be knowing what > > he is doing, and if some other unpredicted change in kernel code cause this > > prctl to panic the kernel such action could be initiated by sysadmin only, > > not regular user. Now I've released the security requirement so I thought > > let be more paranoid and require all VMAs to present, all rlimits to be > > in good shape and such even if the members we're modifying are used for > > procfs statistics output mostly, we always have a way to relax this tests > > but not the reverse. That was the main idea ;) > > Following this logic you should also verify that KSTK_ESP(group_leader) > falls into ->start_stack area or return an error otherwise. And probably I should, I'll re-check. > > As for elf loader, of course it sets ->start_stack/etc correctly and they > can't point to nowehere, but this is only because it actually creates these > segments. And I guess the c/r tools should do this "correctly" too, but > prctl() itself can never verify this. > > OK, lets look at this from another angle. Suppose that an application switches > to another stack and unmaps ->start_stack area. Now, how are you going to > restore it correctly if ->start_stack points to nowhere? Are you going > to "fix" ->start_stack? I do not think this would be correct. Or, are you > going to create the additional vma just to make prctl() happy? Actually all the programs we were c/r'ing are "normal" ones who doesn't mangle own contents but I fear yes, with current approach if some vma's are missing prctl refuse to proceed. Oleg, I'm cooking the patch which addresses all your comments and relaxes the restrictions I simply need once again grep all members to be sure it's safe. Hopefully I'll finish tomorrow and show the patch, still if there some other concerns please notify me I'm happy to address any of them! ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-08-23 20:11 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-08-22 19:22 + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree Oleg Nesterov 2014-08-22 20:15 ` Cyrill Gorcunov 2014-08-23 11:53 ` Oleg Nesterov 2014-08-23 12:22 ` Cyrill Gorcunov 2014-08-23 13:14 ` Oleg Nesterov 2014-08-23 13:30 ` Oleg Nesterov 2014-08-23 14:18 ` Cyrill Gorcunov 2014-08-23 15:32 ` Oleg Nesterov 2014-08-23 16:33 ` Cyrill Gorcunov 2014-08-23 19:29 ` Oleg Nesterov 2014-08-23 20:11 ` Cyrill Gorcunov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox