* Re: [patch 19/24] TASK_SIZE is variable. [not found] <200502050150.j151osl11380@mail.osdl.org> @ 2005-02-05 2:16 ` Linus Torvalds 2005-02-05 3:29 ` Linus Torvalds 2005-02-05 6:54 ` Andi Kleen 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2005-02-05 2:16 UTC (permalink / raw) To: Andrew Morton; +Cc: dwmw2, Linux Arch list On Fri, 4 Feb 2005 akpm@osdl.org wrote: > > From: David Woodhouse <dwmw2@infradead.org> > > Bad things can happen if a 32-bit process is the last user of a 64-bit mm. > TASK_SIZE isn't a constant, and we can end up clearing page tables only up > to the 32-bit TASK_SIZE instead of all the way. We should probably > double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD for this kind > of problem. I think this is the wrong thing. I think this is a bug in the architectures that play games with TASK_SIZE. They shouldn't. TASK_SIZE and USER_PTRS_PER_PGD should _not_ be dependent on whether some process is in compatibility mode or not. The fact that a 32-bit process on a 64-bit kernel wants to limit certain things to the low bits does _not_ mean that TASK_SIZE should change. It should mean that "TASK_UNMAPPED_BASE" might change (preferably not even that, actually - just make the damn arch_get_unmapped_area() have explicit knowledge about min/max issues instead!), and that "tsk->addr_limit" might change. But changing TASK_SIZE is crazy. And leads to exactly the kinds of problems this patch tries to fix (and who knows how many others). Would arch maintainers please look at fixing this in their architectures instead? Make TASK_SIZE be the maximum size a process VM can have. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 2:16 ` [patch 19/24] TASK_SIZE is variable Linus Torvalds @ 2005-02-05 3:29 ` Linus Torvalds 2005-02-05 5:52 ` David S. Miller 2005-02-05 9:06 ` Russell King 2005-02-05 6:54 ` Andi Kleen 1 sibling, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2005-02-05 3:29 UTC (permalink / raw) To: Andrew Morton; +Cc: dwmw2, Linux Arch list On Fri, 4 Feb 2005, Linus Torvalds wrote: > > Would arch maintainers please look at fixing this in their architectures > instead? Make TASK_SIZE be the maximum size a process VM can have. Actually, thinking more about it, maybe the right idea (to avoid confusion) is to walk away from TASK_SIZE entirely. For example, replacing TASK_SIZE in fs/namei.c with "thread->addr_limit" would actually clean up the code: it would mean that the games with "get_fs()" etc would just go away, to be replaced with something like unsigned long len; unsigned long limit = current_thread_info()->addr_limit; if ((unsigned long) filename >= limit) return -EFAULT; len = limit - (unsigned long) filename; if (len > PATH_MAX) len = PATH_MAX; which looks cleaner. (This also assumes that we would make mm_segment_t just be "unsigned long", and stop playing games with segments looking like they were used a few hundred million years ago in the paleozoic era Linux kernel. It made sense back when we did that long-ago conversion, it doesn't seem to make much sense any more ;). Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 3:29 ` Linus Torvalds @ 2005-02-05 5:52 ` David S. Miller 2005-02-07 10:59 ` David Howells 2005-02-05 9:06 ` Russell King 1 sibling, 1 reply; 27+ messages in thread From: David S. Miller @ 2005-02-05 5:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: akpm, dwmw2, linux-arch On Fri, 4 Feb 2005 19:29:17 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > For example, replacing TASK_SIZE in fs/namei.c with "thread->addr_limit" > would actually clean up the code: it would mean that the games with > "get_fs()" etc would just go away, to be replaced with something like I think that looks nice too. Would you be against a "mm->addr_limit"? That's what an approach by Paulus implemented, and I was in the camp supporting that kind of direction. "mm"'s exist potentially long after the "thread" part goes away. That's why my position is that logically any "limit" on the address space is an mm property. And thusly, this is what that mmap() et al. stuff should be checking not some arg'less macro called TASK_SIZE. As an aside, while I realize the bad side effects of making USER_PTRS_PER_PGD dynamic, it makes 32-bit process teardown a lot faster so it's no surprise all the platforms do it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 5:52 ` David S. Miller @ 2005-02-07 10:59 ` David Howells 2005-02-07 19:30 ` David S. Miller 0 siblings, 1 reply; 27+ messages in thread From: David Howells @ 2005-02-07 10:59 UTC (permalink / raw) To: David S. Miller; +Cc: Linus Torvalds, akpm, dwmw2, linux-arch David S. Miller <davem@davemloft.net> wrote: > > For example, replacing TASK_SIZE in fs/namei.c with "thread->addr_limit" > > would actually clean up the code: it would mean that the games with > > "get_fs()" etc would just go away, to be replaced with something like > > I think that looks nice too. > > Would you be against a "mm->addr_limit"? That's what an approach > by Paulus implemented, and I was in the camp supporting that kind > of direction. On the other hand, this value is constant on some archs, and in those cases, wouldn't it be better for it to be a compile-time constant? David ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-07 10:59 ` David Howells @ 2005-02-07 19:30 ` David S. Miller 2005-02-08 9:05 ` Martin Schwidefsky 0 siblings, 1 reply; 27+ messages in thread From: David S. Miller @ 2005-02-07 19:30 UTC (permalink / raw) To: David Howells; +Cc: torvalds, akpm, dwmw2, linux-arch On Mon, 07 Feb 2005 10:59:13 +0000 David Howells <dhowells@redhat.com> wrote: > David S. Miller <davem@davemloft.net> wrote: > > > > For example, replacing TASK_SIZE in fs/namei.c with "thread->addr_limit" > > > would actually clean up the code: it would mean that the games with > > > "get_fs()" etc would just go away, to be replaced with something like > > > > I think that looks nice too. > > > > Would you be against a "mm->addr_limit"? That's what an approach > > by Paulus implemented, and I was in the camp supporting that kind > > of direction. > > On the other hand, this value is constant on some archs, and in those cases, > wouldn't it be better for it to be a compile-time constant? How can it be constant? It would need to change when a set_fs(KERNEL_DS) is performed. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-07 19:30 ` David S. Miller @ 2005-02-08 9:05 ` Martin Schwidefsky 2005-02-08 19:09 ` David S. Miller 0 siblings, 1 reply; 27+ messages in thread From: Martin Schwidefsky @ 2005-02-08 9:05 UTC (permalink / raw) To: David S. Miller; +Cc: akpm, David Howells, dwmw2, linux-arch, torvalds > > > Would you be against a "mm->addr_limit"? That's what an approach > > > by Paulus implemented, and I was in the camp supporting that kind > > > of direction. > > > > On the other hand, this value is constant on some archs, and in those cases, > > wouldn't it be better for it to be a compile-time constant? > > How can it be constant? It would need to change when a set_fs(KERNEL_DS) > is performed. It can be a constant on architectures that use a separate address space for the kernel, e.g. s390. In fact I don't set addr_limit at all, access_ok and friends always return 1. blue skies, Martin Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-08 9:05 ` Martin Schwidefsky @ 2005-02-08 19:09 ` David S. Miller 0 siblings, 0 replies; 27+ messages in thread From: David S. Miller @ 2005-02-08 19:09 UTC (permalink / raw) To: Martin Schwidefsky; +Cc: akpm, dhowells, dwmw2, linux-arch, torvalds On Tue, 8 Feb 2005 10:05:06 +0100 Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > It can be a constant on architectures that use a separate address space for > the kernel, e.g. s390. In fact I don't set addr_limit at all, access_ok and > friends always return 1. I wouldn't say it's constant in this case, but rather that it's not needed at all. :-) Sparc64 also returns "1" all the time, and will continue to do so no matter what the non-split kernel/user platforms end up doing, just like s390. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 3:29 ` Linus Torvalds 2005-02-05 5:52 ` David S. Miller @ 2005-02-05 9:06 ` Russell King 2005-02-05 23:44 ` David S. Miller 1 sibling, 1 reply; 27+ messages in thread From: Russell King @ 2005-02-05 9:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, dwmw2, Linux Arch list On Fri, Feb 04, 2005 at 07:29:17PM -0800, Linus Torvalds wrote: > For example, replacing TASK_SIZE in fs/namei.c with "thread->addr_limit" > would actually clean up the code: it would mean that the games with > "get_fs()" etc would just go away, to be replaced with something like > > unsigned long len; > unsigned long limit = current_thread_info()->addr_limit; > > if ((unsigned long) filename >= limit) > return -EFAULT; > len = limit - (unsigned long) filename; > if (len > PATH_MAX) > len = PATH_MAX; > > which looks cleaner. Except that "addr_limit" may be defined by an architecture to be zero (which can be interpreted as 4GB by the arch specific code) for the case where we allow kernel mode access. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/ 2.6 Serial core ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 9:06 ` Russell King @ 2005-02-05 23:44 ` David S. Miller 2005-02-06 10:50 ` Andi Kleen 0 siblings, 1 reply; 27+ messages in thread From: David S. Miller @ 2005-02-05 23:44 UTC (permalink / raw) To: Russell King; +Cc: torvalds, akpm, dwmw2, linux-arch On Sat, 5 Feb 2005 09:06:19 +0000 Russell King <rmk@arm.linux.org.uk> wrote: > Except that "addr_limit" may be defined by an architecture to be zero > (which can be interpreted as 4GB by the arch specific code) for the > case where we allow kernel mode access. I believe this to be a problematic scheme, let me explain why. First, "set_fs(KERNEL_DS)" allows kernel mode access, but it absolutely must not allow user mode accesses. It seems to suggest we might need some "addr_min" value for access_ok() checking purposes... Also, as I tried to explain in another email today in this thread, cpu's fall roughly into two categories: 1) Single virtual address range, page table protection (or "implicit" protection bits) for address ranges determine supervisor vs. user access. x86_64, x86, MIPS, and Alpha I know fall into this category. 2) Really seperate supervisor and user address spaces. Which one to get at is specified by an added attribute tag given to load and store instructions. There is an implicit tag active at all times which says what a normal load/store accesses. So for example: load_word [%addr] ASI_USER, %reg done from supervisor space cannot possibly reference supervisor space, for any value of %addr. On sparc64, which uses the model as in #2, there is an "%asi" register which holds ASI_* values. So we just make set_fs() update this register with either ASI_USER or ASI_KERNEL. Then for userspace accesses, we use '[%addr] %asi' addressing in the load/store instructions. As a result, access_ok() is a complete NOP. The CPU does all the work at load/store time. On platforms using model #1, access_ok() can use some software state (min_addr/max_addr), which specifies the address where userspace ends and supervisor virtual addresses begin. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 23:44 ` David S. Miller @ 2005-02-06 10:50 ` Andi Kleen 2005-02-06 21:19 ` David S. Miller 0 siblings, 1 reply; 27+ messages in thread From: Andi Kleen @ 2005-02-06 10:50 UTC (permalink / raw) To: David S. Miller; +Cc: Russell King, torvalds, akpm, dwmw2, linux-arch On Sat, Feb 05, 2005 at 03:44:48PM -0800, David S. Miller wrote: > On Sat, 5 Feb 2005 09:06:19 +0000 > Russell King <rmk@arm.linux.org.uk> wrote: > > > Except that "addr_limit" may be defined by an architecture to be zero > > (which can be interpreted as 4GB by the arch specific code) for the > > case where we allow kernel mode access. > > I believe this to be a problematic scheme, let me explain why. > > First, "set_fs(KERNEL_DS)" allows kernel mode access, but it absolutely > must not allow user mode accesses. It seems to suggest we might need > some "addr_min" value for access_ok() checking purposes... That's an unreasonable requirement which no architecture other than those with truly separate address spaces follow. I think on the others it would lead to quite bad code bloat for the additional tests (access_ok is called very often) -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-06 10:50 ` Andi Kleen @ 2005-02-06 21:19 ` David S. Miller 2005-02-06 21:31 ` Andi Kleen 0 siblings, 1 reply; 27+ messages in thread From: David S. Miller @ 2005-02-06 21:19 UTC (permalink / raw) To: Andi Kleen; +Cc: rmk, torvalds, akpm, dwmw2, linux-arch On Sun, 6 Feb 2005 11:50:48 +0100 Andi Kleen <ak@suse.de> wrote: > > First, "set_fs(KERNEL_DS)" allows kernel mode access, but it absolutely > > must not allow user mode accesses. It seems to suggest we might need > > some "addr_min" value for access_ok() checking purposes... > > That's an unreasonable requirement which no architecture other than > those with truly separate address spaces follow. If you allow userspace accesses to succeed during KERNEL_DS, so much bad stuff can happen. We've demonstrated that several times with the compat layer bugs. Why not make it trap on all platforms, instead of until someone hits it on sparc64 or similar? Do you like finding bugs immediately, or at some random time in the future? I like my bugs to jump up and down quickly saying "I'm a bug" instead of "try and find me sucker" :-) The x86 access_ok() already checks the address against a base stored in the current_thread_info(), all I'm proposing is to add a low value to the range as well and to adjust it at set_fs() time, which set_fs() is already effectively doing on x86. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-06 21:19 ` David S. Miller @ 2005-02-06 21:31 ` Andi Kleen 2005-02-06 21:31 ` David S. Miller 0 siblings, 1 reply; 27+ messages in thread From: Andi Kleen @ 2005-02-06 21:31 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, rmk, torvalds, akpm, dwmw2, linux-arch > If you allow userspace accesses to succeed during KERNEL_DS, so > much bad stuff can happen. We've demonstrated that several times > with the compat layer bugs. These are easy to review. > > Why not make it trap on all platforms, instead of until someone hits > it on sparc64 or similar? Because access_ok is used very often and adding anything more to it leads to excessive binary bloat. > Do you like finding bugs immediately, or at some random time > in the future? I like my bugs to jump up and down quickly > saying "I'm a bug" instead of "try and find me sucker" :-) If you want you could add a check with a CONFIG option in the kernel debug menu. But I'm not sure it's worth it, it would probably be somewhat ugly. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-06 21:31 ` Andi Kleen @ 2005-02-06 21:31 ` David S. Miller 2005-02-06 21:50 ` Andi Kleen 0 siblings, 1 reply; 27+ messages in thread From: David S. Miller @ 2005-02-06 21:31 UTC (permalink / raw) To: Andi Kleen; +Cc: rmk, torvalds, akpm, dwmw2, linux-arch On Sun, 6 Feb 2005 22:31:49 +0100 Andi Kleen <ak@suse.de> wrote: > If you want you could add a check with a CONFIG option > in the kernel debug menu. But I'm not sure it's worth it, > it would probably be somewhat ugly. So in some of the compat layer cases we had at one point in the alsa sound layer, in order to find the bug on x86_64 the user would have to 1) use sound and 2) enable a special obscure config option. We're talking about 5 or 6 instructions on x86. If it's getting too big, put it out-of-line. It probably belongs out-of-line in it's current form anyways. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-06 21:31 ` David S. Miller @ 2005-02-06 21:50 ` Andi Kleen 2005-02-06 22:25 ` David S. Miller 2005-02-06 22:31 ` David S. Miller 0 siblings, 2 replies; 27+ messages in thread From: Andi Kleen @ 2005-02-06 21:50 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, rmk, torvalds, akpm, dwmw2, linux-arch On Sun, Feb 06, 2005 at 01:31:10PM -0800, David S. Miller wrote: > We're talking about 5 or 6 instructions on x86. > If it's getting too big, put it out-of-line. It probably > belongs out-of-line in it's current form anyways. I don't think that's a good idea, sorry - make a very common inline considerably bigger and/or slower just to make sparc64 debugging easier. And yes this stuff does matter - i remember i got LM benchmarkable improvements in signal latency by optimizing __copy_to_user to use optimized inlines for small stores. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-06 21:50 ` Andi Kleen @ 2005-02-06 22:25 ` David S. Miller 2005-02-06 22:31 ` David S. Miller 1 sibling, 0 replies; 27+ messages in thread From: David S. Miller @ 2005-02-06 22:25 UTC (permalink / raw) To: Andi Kleen; +Cc: rmk, torvalds, akpm, dwmw2, linux-arch On Sun, 6 Feb 2005 22:50:20 +0100 Andi Kleen <ak@suse.de> wrote: > And yes this stuff does matter - i remember i got LM benchmarkable > improvements in signal latency by optimizing __copy_to_user > to use optimized inlines for small stores. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-06 21:50 ` Andi Kleen 2005-02-06 22:25 ` David S. Miller @ 2005-02-06 22:31 ` David S. Miller 2005-02-07 8:11 ` Andi Kleen 1 sibling, 1 reply; 27+ messages in thread From: David S. Miller @ 2005-02-06 22:31 UTC (permalink / raw) To: Andi Kleen; +Cc: rmk, torvalds, akpm, dwmw2, linux-arch On Sun, 6 Feb 2005 22:50:20 +0100 Andi Kleen <ak@suse.de> wrote: > And yes this stuff does matter - i remember i got LM benchmarkable > improvements in signal latency by optimizing __copy_to_user > to use optimized inlines for small stores. Moving access_ok() out-of-line might even improve I-cache access over what we have today, even with the new min-max check. The min-max variables will be in the same cache line in whatever struct we place them into, so whatever cache miss access_ok() gets now will also be the same for the min-max version. This is kind of strange to be arguing about, given that we just put 4-level page tables into the tree, right? That regressed everybody performance wise, even people not using the full 4-level support. But I have not barked at you about this, I undersand why it's needed. And yet you're using lmbench cycle counting to justify your position against this new verification scheme. And it's not just a sparc64 issue. Sparc64 hardware traps the access, but it's a bug regardless of platform to try to do user accesses whilst get_fs()==KERNEL_DS. All the user has to do is pass in a valid kernel address and you have a root exploit. I mean, do folks really disagree with this? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-06 22:31 ` David S. Miller @ 2005-02-07 8:11 ` Andi Kleen 2005-02-07 19:28 ` David S. Miller 0 siblings, 1 reply; 27+ messages in thread From: Andi Kleen @ 2005-02-07 8:11 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, rmk, torvalds, akpm, dwmw2, linux-arch On Sun, Feb 06, 2005 at 02:31:10PM -0800, David S. Miller wrote: > This is kind of strange to be arguing about, given that we just > put 4-level page tables into the tree, right? That regressed > everybody performance wise, even people not using the full Hmm, are you sure? I benchmarked i386 and there was no measurable difference. x86-64 is regressed right now for fork etc., but that is to be expected because it walks twice as many page tables. And there is a plan to fix it. Text size has increased a little bit for all people, but not too bad. My benchmarking was with the PML4 patchkit though, I haven't retried it with PUD. > And it's not just a sparc64 issue. Sparc64 hardware traps the > access, but it's a bug regardless of platform to try to do user > accesses whilst get_fs()==KERNEL_DS. All the user has to do is On x86-64 it's not a bug and we have code in the x86-64 specific compat layer that relies on this. Also BTW your proposed check wouldn't catch all cases anyways, it would only handle the case where the access_ok() check is done inside the KERNEL_DS. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-07 8:11 ` Andi Kleen @ 2005-02-07 19:28 ` David S. Miller 2005-02-07 20:15 ` Andi Kleen 0 siblings, 1 reply; 27+ messages in thread From: David S. Miller @ 2005-02-07 19:28 UTC (permalink / raw) To: Andi Kleen; +Cc: rmk, torvalds, akpm, dwmw2, linux-arch On Mon, 7 Feb 2005 09:11:06 +0100 Andi Kleen <ak@suse.de> wrote: > Also BTW your proposed check wouldn't catch all cases anyways, > it would only handle the case where the access_ok() check is > done inside the KERNEL_DS. There must be a verify area done (either via explicit call or via the user access macros which do not have the "__" prefix) for each range of "userspace" accesses done through the uaccess.h accessors. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-07 19:28 ` David S. Miller @ 2005-02-07 20:15 ` Andi Kleen 2005-02-07 20:13 ` David S. Miller 0 siblings, 1 reply; 27+ messages in thread From: Andi Kleen @ 2005-02-07 20:15 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, rmk, torvalds, akpm, dwmw2, linux-arch On Mon, Feb 07, 2005 at 11:28:37AM -0800, David S. Miller wrote: > On Mon, 7 Feb 2005 09:11:06 +0100 > Andi Kleen <ak@suse.de> wrote: > > > Also BTW your proposed check wouldn't catch all cases anyways, > > it would only handle the case where the access_ok() check is > > done inside the KERNEL_DS. > > There must be a verify area done (either via explicit call or > via the user access macros which do not have the "__" prefix) > for each range of "userspace" accesses done through the uaccess.h accessors. Yes, but it is done outside KERNEL_DS (otherwise it is a security hole) And then later in KERNEL_DS there is no verify_area. So this change wouldn't help you at all because it would rarely if ever catch any bugs. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-07 20:15 ` Andi Kleen @ 2005-02-07 20:13 ` David S. Miller 0 siblings, 0 replies; 27+ messages in thread From: David S. Miller @ 2005-02-07 20:13 UTC (permalink / raw) To: Andi Kleen; +Cc: rmk, torvalds, akpm, dwmw2, linux-arch On Mon, 7 Feb 2005 21:15:47 +0100 Andi Kleen <ak@suse.de> wrote: > Yes, but it is done outside KERNEL_DS (otherwise it is a security hole) > And then later in KERNEL_DS there is no verify_area. It is done "inside" KERNEL_DS by the routines we invoke which expect user pointers but we're giving them kernel pointers. Example: extern long sys_foo(char __user *buf, int len); long compat_sys_foo(compat_uptr_t ubuf, int len) { char *kbuf = kmalloc(len, GFP_KERNEL); mm_segment_t old_fs = get_fs(); int err; if (!kbuf) return -ENOMEM; set_fs(KERNEL_DS); err = sys_foo(kbuf, len); set_fs(old_fs); kfree(kbuf); return err; } The copy_to_user() or whatever done by sys_foo() will operate within KERNEL_DS on "kbuf" and thus the access_ok() check done via copy_to_user() will do the proper checks for us with my proposal of valid virtual address ranges stored in the mm_struct. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 2:16 ` [patch 19/24] TASK_SIZE is variable Linus Torvalds 2005-02-05 3:29 ` Linus Torvalds @ 2005-02-05 6:54 ` Andi Kleen 2005-02-05 7:18 ` Andrew Morton 1 sibling, 1 reply; 27+ messages in thread From: Andi Kleen @ 2005-02-05 6:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, dwmw2, Linux Arch list, mingo > Would arch maintainers please look at fixing this in their architectures > instead? Make TASK_SIZE be the maximum size a process VM can have. The problem is that compat flexmmap depends on it. I actually dropped flexmmap again, but the x86-64 flexmmap which will be merged at some point again had a x86-64 change to switch to variable TASK_SIZE. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 6:54 ` Andi Kleen @ 2005-02-05 7:18 ` Andrew Morton 2005-02-05 7:40 ` Andi Kleen 2005-02-05 23:15 ` David S. Miller 0 siblings, 2 replies; 27+ messages in thread From: Andrew Morton @ 2005-02-05 7:18 UTC (permalink / raw) To: Andi Kleen; +Cc: torvalds, dwmw2, linux-arch, mingo Andi Kleen <ak@suse.de> wrote: > > > Would arch maintainers please look at fixing this in their architectures > > instead? Make TASK_SIZE be the maximum size a process VM can have. > > The problem is that compat flexmmap depends on it. I actually dropped flexmmap > again, but the x86-64 flexmmap which will be merged at some point again > had a x86-64 change to switch to variable TASK_SIZE. TASK_SIZE is a property of the mm_struct, is it not? Why can't add mm_struct.task_size and kill off TASK_SIZE altogether? And MM_VM_SIZE? Stuff like ia64's #define TASK_SIZE (current->thread.task_size) just seems dead wrong to me, in the case where a random /proc reader is the last one left holding a reference to the mm. Paul's patch seems disjoint from the above - it adds task_size.max_addr which records the highest virtual address which the mm has mapped. Nice for optimising various teardown things, but not suitable for TASK_SIZE elimination. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 7:18 ` Andrew Morton @ 2005-02-05 7:40 ` Andi Kleen 2005-02-05 23:27 ` David S. Miller 2005-02-05 23:15 ` David S. Miller 1 sibling, 1 reply; 27+ messages in thread From: Andi Kleen @ 2005-02-05 7:40 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, torvalds, dwmw2, linux-arch, mingo On Fri, Feb 04, 2005 at 11:18:16PM -0800, Andrew Morton wrote: > Andi Kleen <ak@suse.de> wrote: > > > > > Would arch maintainers please look at fixing this in their architectures > > > instead? Make TASK_SIZE be the maximum size a process VM can have. > > > > The problem is that compat flexmmap depends on it. I actually dropped flexmmap > > again, but the x86-64 flexmmap which will be merged at some point again > > had a x86-64 change to switch to variable TASK_SIZE. > > TASK_SIZE is a property of the mm_struct, is it not? Why can't add > mm_struct.task_size and kill off TASK_SIZE altogether? And MM_VM_SIZE? Probably yes. I wouldn't combine it with set_fs like Linus proposed though, because that could lead to mistakes in set_fs(KERNEL_DS) -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 7:40 ` Andi Kleen @ 2005-02-05 23:27 ` David S. Miller 2005-02-06 10:38 ` Andi Kleen 2005-02-06 13:05 ` Matthew Wilcox 0 siblings, 2 replies; 27+ messages in thread From: David S. Miller @ 2005-02-05 23:27 UTC (permalink / raw) To: Andi Kleen; +Cc: akpm, torvalds, dwmw2, linux-arch, mingo On Sat, 5 Feb 2005 08:40:25 +0100 Andi Kleen <ak@suse.de> wrote: > I wouldn't combine it with set_fs like Linus proposed though, because > that could lead to mistakes in set_fs(KERNEL_DS) That's a good point. However, from another perspective it does make some sense. I only know of a cpu type or two that really provide strong address space protection for accidental supervisor mode accesses to user space. I know sparc64 provides this for sure, and I think parisc has something similar as well. Andi is very familiar with this, because some compat layer code written and tested on x86_64 fails on sparc64 because the latter disallows the following: extern int sys_foo(const char __user *buf, unsigned long __user *ret_val); int compat_sys_foo(compat_uptr_t u_buf, compat_uptr_t u_ret_val) { const char __user *buf = compat_ptr(u_buf); unsigned long k_val; mm_segment_t old_fs = get_fs(); int err; set_fs(KERNEL_DS); err = sys_foo(buf, (unsigned long __user *) &k_val); ... This does not fault on x86_64, but it does on platforms like sparc64. Even though it doesn't fault on x86_64, it's a security hole because it allows the user to pass in kernel addresses, and such kernel addresses will just work since we're in KERNEL_DS. If set_fs() updated some mm->max_addr thing, access_ok() and friends would trap things like this in software even on x86_64. Therefore, I think if anything it's a very good bug check. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 23:27 ` David S. Miller @ 2005-02-06 10:38 ` Andi Kleen 2005-02-06 13:05 ` Matthew Wilcox 1 sibling, 0 replies; 27+ messages in thread From: Andi Kleen @ 2005-02-06 10:38 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, akpm, torvalds, dwmw2, linux-arch, mingo > int compat_sys_foo(compat_uptr_t u_buf, compat_uptr_t u_ret_val) > { > const char __user *buf = compat_ptr(u_buf); > unsigned long k_val; > mm_segment_t old_fs = get_fs(); > int err; > > set_fs(KERNEL_DS); > err = sys_foo(buf, (unsigned long __user *) &k_val); > ... > > This does not fault on x86_64, but it does on platforms like sparc64. Actually it faults on UML/x86-64 too :) > Even though it doesn't fault on x86_64, it's a security hole because it > allows the user to pass in kernel addresses, and such kernel addresses > will just work since we're in KERNEL_DS. the caller just has to verify_area() everything. Not doing that would be a security hole yes. One of the reason I think it's a good idea to discourage because driver writers often get this detail wrong. But even with all that compat code going away set_fs will stay: there are places like network IO in kernel etc. where there is just no way around it. > If set_fs() updated some mm->max_addr thing, access_ok() and friends > would trap things like this in software even on x86_64. Therefore, > I think if anything it's a very good bug check. Hmm, I don't see what change it would make. Currently in KERNEL_DS access_ok is always true. I don't see how you can change this without breaking everything? In theory you could make it check for user space addresses and then fail on i386/x86-64 ((addr) >= TASK_SIZE), but that would bloat the code generated by this common macro a lot and it's probably not worth it. But mm->max_addr wouldn't help you with this at all, you would need a new mm->min_addr which I didn't think anybody was proposing. -Andi ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 23:27 ` David S. Miller 2005-02-06 10:38 ` Andi Kleen @ 2005-02-06 13:05 ` Matthew Wilcox 1 sibling, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2005-02-06 13:05 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, akpm, torvalds, dwmw2, linux-arch, mingo On Sat, Feb 05, 2005 at 03:27:52PM -0800, David S. Miller wrote: > However, from another perspective it does make some sense. > I only know of a cpu type or two that really provide strong > address space protection for accidental supervisor mode accesses > to user space. I know sparc64 provides this for sure, and I > think parisc has something similar as well. Yes, we essentially have a 4G/4G split on PA. We rather abuse addr_limit though. I'm not particularly happy with the current uaccess code and it's something I plan to clean up long-term. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 19/24] TASK_SIZE is variable. 2005-02-05 7:18 ` Andrew Morton 2005-02-05 7:40 ` Andi Kleen @ 2005-02-05 23:15 ` David S. Miller 1 sibling, 0 replies; 27+ messages in thread From: David S. Miller @ 2005-02-05 23:15 UTC (permalink / raw) To: Andrew Morton; +Cc: ak, torvalds, dwmw2, linux-arch, mingo On Fri, 4 Feb 2005 23:18:16 -0800 Andrew Morton <akpm@osdl.org> wrote: > TASK_SIZE is a property of the mm_struct, is it not? I agree. This is being restated several times, and hope that there is some kind of consensus on this issue at this point. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2005-02-08 19:09 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200502050150.j151osl11380@mail.osdl.org>
2005-02-05 2:16 ` [patch 19/24] TASK_SIZE is variable Linus Torvalds
2005-02-05 3:29 ` Linus Torvalds
2005-02-05 5:52 ` David S. Miller
2005-02-07 10:59 ` David Howells
2005-02-07 19:30 ` David S. Miller
2005-02-08 9:05 ` Martin Schwidefsky
2005-02-08 19:09 ` David S. Miller
2005-02-05 9:06 ` Russell King
2005-02-05 23:44 ` David S. Miller
2005-02-06 10:50 ` Andi Kleen
2005-02-06 21:19 ` David S. Miller
2005-02-06 21:31 ` Andi Kleen
2005-02-06 21:31 ` David S. Miller
2005-02-06 21:50 ` Andi Kleen
2005-02-06 22:25 ` David S. Miller
2005-02-06 22:31 ` David S. Miller
2005-02-07 8:11 ` Andi Kleen
2005-02-07 19:28 ` David S. Miller
2005-02-07 20:15 ` Andi Kleen
2005-02-07 20:13 ` David S. Miller
2005-02-05 6:54 ` Andi Kleen
2005-02-05 7:18 ` Andrew Morton
2005-02-05 7:40 ` Andi Kleen
2005-02-05 23:27 ` David S. Miller
2005-02-06 10:38 ` Andi Kleen
2005-02-06 13:05 ` Matthew Wilcox
2005-02-05 23:15 ` David S. Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox