* execve(NULL, argv, envp) for nommu? @ 2017-09-05 7:34 Rob Landley 2017-09-05 9:00 ` Geert Uytterhoeven 0 siblings, 1 reply; 13+ messages in thread From: Rob Landley @ 2017-09-05 7:34 UTC (permalink / raw) To: linux-embedded For years I've wanted an execve() system call modification that let me pass a NULL as the first argument to say "re-exec this program please". Because on nommu you've got to exec something to unblock vfork(), and daemons (or things like busybox and toybox) want to re-exec themselves. I just hit this again trying to implement a nommu-friendly strace(): the one on github doesn't SIGSTOP the child before the execve() of the process to trace because vfork(), and just races and misses the first few system calls on nommu instead...) The problem with exec /proc/self/exe is A) I haven't necessarily got /proc mounted, B) in a chroot the original binary might not be in scope anymore. But I'm already _running_ this program. If I could fork() I could already get a second copy of the sucker and call main() again myself if necessary, but I can't, so... I'm aware there's a possible "but what if it was suid and it's already dropped privileges" argument, and I'm fine with execve(NULL) not honoring the suid bit if people feel that way. I just wanna unblock vfork() while still running this code. (A way to detect I did this would be great too, but the normal tweaking of argv[] or envp[] to let main know we're a child still works.) Is there a _reason_ the kernel doesn't do this, or has nobody bothered to code it up yet? Rob ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-05 7:34 execve(NULL, argv, envp) for nommu? Rob Landley @ 2017-09-05 9:00 ` Geert Uytterhoeven 2017-09-05 13:24 ` Alan Cox 0 siblings, 1 reply; 13+ messages in thread From: Geert Uytterhoeven @ 2017-09-05 9:00 UTC (permalink / raw) To: Rob Landley; +Cc: Linux Embedded, Oleg Nesterov, linux-kernel@vger.kernel.org CC Oleg, lkml On Tue, Sep 5, 2017 at 9:34 AM, Rob Landley <rob@landley.net> wrote: > For years I've wanted an execve() system call modification that let me > pass a NULL as the first argument to say "re-exec this program please". > Because on nommu you've got to exec something to unblock vfork(), and > daemons (or things like busybox and toybox) want to re-exec themselves. > I just hit this again trying to implement a nommu-friendly strace(): the > one on github doesn't SIGSTOP the child before the execve() of the > process to trace because vfork(), and just races and misses the first > few system calls on nommu instead...) > > The problem with exec /proc/self/exe is A) I haven't necessarily got > /proc mounted, B) in a chroot the original binary might not be in scope > anymore. But I'm already _running_ this program. If I could fork() I > could already get a second copy of the sucker and call main() again > myself if necessary, but I can't, so... > > I'm aware there's a possible "but what if it was suid and it's already > dropped privileges" argument, and I'm fine with execve(NULL) not > honoring the suid bit if people feel that way. I just wanna unblock > vfork() while still running this code. (A way to detect I did this would > be great too, but the normal tweaking of argv[] or envp[] to let main > know we're a child still works.) > > Is there a _reason_ the kernel doesn't do this, or has nobody bothered > to code it up yet? > > Rob ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-05 9:00 ` Geert Uytterhoeven @ 2017-09-05 13:24 ` Alan Cox 2017-09-06 1:12 ` Rob Landley 0 siblings, 1 reply; 13+ messages in thread From: Alan Cox @ 2017-09-05 13:24 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Rob Landley, Linux Embedded, Oleg Nesterov, linux-kernel@vger.kernel.org > > anymore. But I'm already _running_ this program. If I could fork() I > > could already get a second copy of the sucker and call main() again > > myself if necessary, but I can't, so... You can - ptrace 8) > > honoring the suid bit if people feel that way. I just wanna unblock > > vfork() while still running this code. Would it make more sense to have a way to promote your vfork into a fork when you hit these cases (I appreciate that fork on NOMMU has a much higher performance cost as you start having to softmmu copy or swap pages). Alan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-05 13:24 ` Alan Cox @ 2017-09-06 1:12 ` Rob Landley 2017-09-08 21:18 ` Rob Landley 2017-09-11 18:14 ` Alan Cox 0 siblings, 2 replies; 13+ messages in thread From: Rob Landley @ 2017-09-06 1:12 UTC (permalink / raw) To: Alan Cox, Geert Uytterhoeven Cc: Linux Embedded, Oleg Nesterov, dalias, linux-kernel@vger.kernel.org On 09/05/2017 08:24 AM, Alan Cox wrote: >>> anymore. But I'm already _running_ this program. If I could fork() I >>> could already get a second copy of the sucker and call main() again >>> myself if necessary, but I can't, so... > > You can - ptrace 8) Oh I can call clone() with various flags and try to fake it myself, it just won't do what I want. :) >>> honoring the suid bit if people feel that way. I just wanna unblock >>> vfork() while still running this code. > > Would it make more sense to have a way to promote your vfork into a > fork when you hit these cases (I appreciate that fork on NOMMU has a much > higher performance cost as you start having to softmmu copy or swap > pages). It's not the performance cost, it's rewriting all the pointers. Without address translation, copying the existing mappings to a new range requires finding and adjusting every pointer to the old data, which you can do for the executable mappings in PIE* binaries, but tracking down all the pointers on the stack, heap, and in your global variables? Flaming pain. Making fork() work on nommu is basically the same problem as making garbage collection work in C on mmu. Thus those of us who defend vfork() from the people who don't understand why it exists periodically suggesting we remove it. > Alan Rob * or FDPIC, which is basically just PIE with 4 individually relocatable text/data/rodata/bss segments instead of one big mapping you relocate as a contiguous block; both work on nommu but fdpic can fit into more fragmented memory, and becauase the segments are independent it lets nommu share some segments between processes (code+rodata**) without sharing others (data and bss). That's why nommu can't run normal elf but can run PIE or FDPIC binaries. Or binflt which is the old a.out version. ** Don't ask me what happens when rodata contains a constant pointer to a bss or data object. I'm guessing the compiler Does A Thing. Ask Rich Felker? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-06 1:12 ` Rob Landley @ 2017-09-08 21:18 ` Rob Landley 2017-09-11 15:15 ` Oleg Nesterov 2017-09-11 18:14 ` Alan Cox 1 sibling, 1 reply; 13+ messages in thread From: Rob Landley @ 2017-09-08 21:18 UTC (permalink / raw) To: Alan Cox, Geert Uytterhoeven Cc: Linux Embedded, Oleg Nesterov, dalias, linux-kernel@vger.kernel.org On 09/05/2017 08:12 PM, Rob Landley wrote: > On 09/05/2017 08:24 AM, Alan Cox wrote: >>>> honoring the suid bit if people feel that way. I just wanna unblock >>>> vfork() while still running this code. >> >> Would it make more sense to have a way to promote your vfork into a >> fork when you hit these cases (I appreciate that fork on NOMMU has a much >> higher performance cost as you start having to softmmu copy or swap >> pages). > > It's not the performance cost, it's rewriting all the pointers. > > Without address translation, copying the existing mappings to a new > range requires finding and adjusting every pointer to the old data, > which you can do for the executable mappings in PIE* binaries, but > tracking down all the pointers on the stack, heap, and in your global > variables? Flaming pain. > > Making fork() work on nommu is basically the same problem as making > garbage collection work in C on mmu. Thus those of us who defend vfork() > from the people who don't understand why it exists periodically > suggesting we remove it. So is exec(NULL, argv, envp) a reasonable thing to want? Rob ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-08 21:18 ` Rob Landley @ 2017-09-11 15:15 ` Oleg Nesterov 2017-09-12 10:48 ` Rob Landley 0 siblings, 1 reply; 13+ messages in thread From: Oleg Nesterov @ 2017-09-11 15:15 UTC (permalink / raw) To: Rob Landley Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel@vger.kernel.org On 09/08, Rob Landley wrote: > > So is exec(NULL, argv, envp) a reasonable thing to want? I think that something like prctl(PR_OPEN_EXE_FILE) which does dentry_open(current->mm->exe_file->path, O_PATH) and returns fd make more sense. Then you can do execveat(fd, "", ..., AT_EMPTY_PATH). But to be honest, I can't understand the problem, because I know nothing about nommu. You need to unblock parent sleeping in vfork(), and you can't do another fork (I don't undestand why). Perhaps the child can create another thread? The main thread can exit after that and unblock the parent. Or perhaps even something like clone(CLONE_VM | CLONE_PARENT), I dunno... Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-11 15:15 ` Oleg Nesterov @ 2017-09-12 10:48 ` Rob Landley 2017-09-12 11:30 ` Geert Uytterhoeven 2017-09-12 15:45 ` Oleg Nesterov 0 siblings, 2 replies; 13+ messages in thread From: Rob Landley @ 2017-09-12 10:48 UTC (permalink / raw) To: Oleg Nesterov Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel@vger.kernel.org On 09/11/2017 10:15 AM, Oleg Nesterov wrote: > On 09/08, Rob Landley wrote: >> >> So is exec(NULL, argv, envp) a reasonable thing to want? > > I think that something like prctl(PR_OPEN_EXE_FILE) which does > > dentry_open(current->mm->exe_file->path, O_PATH) > > and returns fd make more sense. > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH). I'm all for it? That sounds like a cosmetic difference, a more verbose way of achieving the same outcome. (Of course now you've got a filehandle you can read xattrs and such through from otherwise jailed contexts letting you do things you couldn't necessarily do before, but I assume you know the security implications of that more than I do. I tried to suggest something that _didn't_ create new capabilities, just let nommu do a thing that mmu could already do.) > But to be honest, I can't understand the problem, because I know nothing > about nommu. > > You need to unblock parent sleeping in vfork(), and you can't do another > fork (I don't undestand why). A nommu system doesn't have a memory management unit, so all addresses are physical addresses. This means two processes can't see different things at the same address: either they see the same thing or one of them can't see that address (due to a range register making it). Conventional fork() creates copy on write mappings of all the existing writable memory of the parent process. So when the new PID dirties a page, the old page gets copied by the fault handler. The problem isn't the copies (that's just slow), the problem is two processes seeing different things at the same address. That requires an MMU with a TLB loaded from page tables. If you create _new_ mappings and copy the data over, they'll have different addresses. But any pointers you copied will point to the _old_ addresses. Finding and adjusting all those pointers to point to the new addresses instead is basically the same problem as doing garbage collection in C. Your stack has pointers. Your heap has pointers. Your data and bss (once initialized) can have pointers. These pointers can be in the middle of malloc()'ed structures so no ELF table anywhere knows anything about them. A long variable containing a value that _could_ point into one of these ranges isn't guaranteed to _be_ a pointer, in which case adjusting it is breakage. Tracking them all down and fixing up just the right ones without missing any or changing data you shouldn't is REALLY HARD. The vfork() system call is what you use on nommu instead: it creates a child process that uses its parent's memory mappings. The parent process is stopped until the child calls _exit() or exec(), either of which means it stops using those mappings and the parent can go back to using them without the two stomping on each other. (Usually they even share the same stack, so the child shouldn't return from the function that called vfork() or it'll corrupt the stack for the parent process. And be careful about changing local variables, the parent might see the changes when it resumes. Some vfork() implementations provide a small new stack, ala signal handlers or kernel interrupts, so you can't guarantee your parent will see your local variable changes, but you still can't return from the function that called vfork() in either case.) So after calling vfork(), the child _must_ call exec() in order for there to be two independent processes running at the same time. Until then, the parent is stopped. The real problem with implementing full fork() isn't the expense of copying the data (although if you fork and exec from a mozilla style pig process, you could copy hundreds of megabytes of data and then immediately discard it again; that's why fork() doesn't usually do that; oh and on nommu systems you need _contiguous_ memory blocks for the data because it can't collect disparate pages together into a longer mapping, so this is actually a largeish real-world issue on those systems, not merely slow and expensive.) The hard problem is translating the pointers so the new mapping doesn't read/write objects in the old mapping. > Perhaps the child can create another thread? The main thread can exit > after that and unblock the parent. Or perhaps even something like > clone(CLONE_VM | CLONE_PARENT), I dunno... Launching a new thread doesn't unblock the parent. A second vfork() from the child wouldn't unblock the parent. Your mappings are still overcommited, only _exit() or execve() releases the child process's use of those mappings. You can create threads on nommu because they're designed to share the same mappings. In that case you're guaranteed a new stack, and not stomping the parent's data is your problem. But if you exec() from a thread, posix says it kills all the other threads: http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html And even without that, we're still in the "vfork but add concurrency" territory. Your threads don't have their own independent mappings, they're sharing and stomping each other's data unless you add locking and write your program to know about the other threads. To get two independent process contexts running the same executable but with different mappings (I.E. the goal we started with), you still need the child to exec. And the start of this thread was "exec what"? > Oleg. Rob ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-12 10:48 ` Rob Landley @ 2017-09-12 11:30 ` Geert Uytterhoeven 2017-09-12 13:45 ` Rob Landley 2017-09-12 15:45 ` Oleg Nesterov 1 sibling, 1 reply; 13+ messages in thread From: Geert Uytterhoeven @ 2017-09-12 11:30 UTC (permalink / raw) To: Rob Landley Cc: Oleg Nesterov, Alan Cox, Linux Embedded, Rich Felker, linux-kernel@vger.kernel.org Hi Rob, On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley <rob@landley.net> wrote: > A nommu system doesn't have a memory management unit, so all addresses > are physical addresses. This means two processes can't see different > things at the same address: either they see the same thing or one of > them can't see that address (due to a range register making it). > > Conventional fork() creates copy on write mappings of all the existing > writable memory of the parent process. So when the new PID dirties a > page, the old page gets copied by the fault handler. The problem isn't > the copies (that's just slow), the problem is two processes seeing > different things at the same address. That requires an MMU with a TLB > loaded from page tables. > > If you create _new_ mappings and copy the data over, they'll have > different addresses. But any pointers you copied will point to the _old_ > addresses. Finding and adjusting all those pointers to point to the new > addresses instead is basically the same problem as doing garbage > collection in C. > > Your stack has pointers. Your heap has pointers. Your data and bss (once > initialized) can have pointers. These pointers can be in the middle of > malloc()'ed structures so no ELF table anywhere knows anything about > them. A long variable containing a value that _could_ point into one of > these ranges isn't guaranteed to _be_ a pointer, in which case adjusting > it is breakage. Tracking them all down and fixing up just the right ones > without missing any or changing data you shouldn't is REALLY HARD. Hence (make the compiler) never store pointers, only offsets relative to a base register. So after making copies of stack, data/bss, and heap, all you need to do is adjust these base registers for the child process. Nothing in main memory needs to be modified. Text accesses can be PC-relative => nothing to adjust. Local variable accesses are stack-relative => nothing to adjust. Data/bss accesses can be relative to a reserved register that stores the data base address => only adjust the base register, nothing in RAM to adjust. Heap accesses can be relative to a reserved register that stores the heap base address => only adjust the base register, nothing in RAM to adjust. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-12 11:30 ` Geert Uytterhoeven @ 2017-09-12 13:45 ` Rob Landley 2017-09-13 19:33 ` Alan Cox 0 siblings, 1 reply; 13+ messages in thread From: Rob Landley @ 2017-09-12 13:45 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Oleg Nesterov, Alan Cox, Linux Embedded, Rich Felker, linux-kernel@vger.kernel.org On 09/12/2017 06:30 AM, Geert Uytterhoeven wrote: > Hi Rob, > > On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley <rob@landley.net> wrote: >> Your stack has pointers. Your heap has pointers. Your data and bss (once >> initialized) can have pointers. These pointers can be in the middle of >> malloc()'ed structures so no ELF table anywhere knows anything about >> them. A long variable containing a value that _could_ point into one of >> these ranges isn't guaranteed to _be_ a pointer, in which case adjusting >> it is breakage. Tracking them all down and fixing up just the right ones >> without missing any or changing data you shouldn't is REALLY HARD. > > Hence (make the compiler) never store pointers, only offsets relative to a > base register. So after making copies of stack, data/bss, and heap, all you > need to do is adjust these base registers for the child process. > Nothing in main memory needs to be modified. Ok, I'll bite. How do you set a signal handler under this regime, since that needs to pass a function pointer to the syscall? Have a different function pointer type for when you want a real pointer instead of an offset pointer? Perhaps label them "near" and "far" pointers, since there's precedent for that back under DOS? When you call printf(), how does it accept both a "string constant" living in rodata and a char array on the stack? Two printf functions with different argument types? If it _does_ take an actual memory address rather than an offset that isn't always vs the same segment then you've written pointers to the stack... You're also requiring static linking: shared libraries work just fine with fdpic, but under your segment:offset addressing system all text has to be relative to the same code segment. Plus there's still the "fork() off of mozilla" problem that you may copy lots of data just to immediately discard it as the common case (unless you'd still use vfork() for most things), and you still need contiguous blocks of memory for each segment (nommu is vulnerable to fragmentation, increasingly so as the system stays up longer) so your fork() will fail where vfork() succeeds. But that just makes it really slow and unreliable, rather than requiring a large rewrite of the C language. > Text accesses can be PC-relative => nothing to adjust. > Local variable accesses are stack-relative => nothing to adjust. > Data/bss accesses can be relative to a reserved register that stores the > data base address => only adjust the base register, nothing in RAM to adjust. Does this compiler setup you're describing actually exist? Instead of making a minor adjustment to one system call, it's better to extensively rewrite compilers and calling conventions, ignoring the way C traditionally treats strings and arrays as pointers where pointers into data, bss, heap, and stack are all used interchangeably... > Heap accesses can be relative to a reserved register that stores the heap > base address => only adjust the base register, nothing in RAM to adjust. Query: if you implement a linked list ala: struct blah { struct blah *next; char *key, *value; }; If next points to a malloc(), key is a constant string in rodata, and value was strchr(getenv(key), '=')+1 (with appropriate error checking of course), how does your compiler know which segment each pointer in that structure is offset from? (What segment IS your environment space relative to, anyway? It's not the _current_ value of your stack pointer, that moves.) How does your proposed compiler rewrite handle mmap()? You can do MAP_SHARED just fine on nommu today, it's only MAP_PRIVATE that requires copy on write. (Yes MAP_SHARED can be read only.) You're aware that most heap implementations can have more than one underlying mmap(), right? http://git.musl-libc.org/cgit/musl/tree/src/malloc/malloc.c#n320 https://github.com/kraj/uClibc/blob/master/libc/stdlib/malloc/malloc.c#L121 So when you say _the_ heap base address above, which chunk are you referring to? Rob ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-12 13:45 ` Rob Landley @ 2017-09-13 19:33 ` Alan Cox 0 siblings, 0 replies; 13+ messages in thread From: Alan Cox @ 2017-09-13 19:33 UTC (permalink / raw) To: Rob Landley Cc: Geert Uytterhoeven, Oleg Nesterov, Linux Embedded, Rich Felker, linux-kernel@vger.kernel.org > Ok, I'll bite. How do you set a signal handler under this regime, since > that needs to pass a function pointer to the syscall? Have a different > function pointer type for when you want a real pointer instead of an > offset pointer? Perhaps label them "near" and "far" pointers, since > there's precedent for that back under DOS? A function pointer is an offset relative to the base of the code (but the other comments are mostly valid) For most hardware it's cheaper to just do it the way Minix did, especially as all the hard work in being able to share code and copy/migrate data happens to have been done in order to make XIP work. A modern CPU can copy memory at lot faster than an 8MHZ 68K which couldn't even manage to move 16bits/clock. > You're also requiring static linking: shared libraries work just fine > with fdpic, but under your segment:offset addressing system all text has > to be relative to the same code segment. No - see the Windows 16bit approach to this. Bring a bucket though 8) > Plus there's still the "fork() off of mozilla" problem that you may copy > lots of data just to immediately discard it as the common case (unless > you'd still use vfork() for most things), and you still need contiguous > blocks of memory for each segment (nommu is vulnerable to fragmentation, > increasingly so as the system stays up longer) so your fork() will fail > where vfork() succeeds. But that just makes it really slow and If you just do copies and scheduling time swaps of memory blocks then fragmentation isn't a problem because you can fragment the copy not currently running. In fact you can (as MAPUX did) extend this to completely kill the fragmentation problem at the cost of turning sustained high memory usage with few process deaths into very poor performance. MAPUX algorithm works very hard to keep stuff unfragmented but is prepared to move chunks of other processes temporarily around in order to keep the running process where it should be. In effect it implements a software paged MMU with an allocator that tries to achieve a 1:1 mapping of the virt/phys of the process. POSIX tries to side step all of this by providing a combined fork/mess with file handles of child etc/execve function (posix_spawn) that an MMUless system can implement to provide the usual functionalities of fork() / execve() like handle redirection. There are also other ways to implement that with threads not sharing file handles if you have enough thread capability (something posix spawn can't assume). Alan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-12 10:48 ` Rob Landley 2017-09-12 11:30 ` Geert Uytterhoeven @ 2017-09-12 15:45 ` Oleg Nesterov 2017-09-13 14:20 ` Oleg Nesterov 1 sibling, 1 reply; 13+ messages in thread From: Oleg Nesterov @ 2017-09-12 15:45 UTC (permalink / raw) To: Rob Landley Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel@vger.kernel.org On 09/12, Rob Landley wrote: > > On 09/11/2017 10:15 AM, Oleg Nesterov wrote: > > On 09/08, Rob Landley wrote: > >> > >> So is exec(NULL, argv, envp) a reasonable thing to want? > > > > I think that something like prctl(PR_OPEN_EXE_FILE) which does > > > > dentry_open(current->mm->exe_file->path, O_PATH) > > > > and returns fd make more sense. > > > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH). > I'm all for it? That sounds like a cosmetic difference, a more verbose > way of achieving the same outcome. Simpler to implement. Something like the (untested) patch below. Not sure it is correct, not sure it is good idea, etc. > (Of course now you've got a filehandle you can read xattrs and such > through from otherwise jailed contexts letting you do things you > couldn't necessarily do before, I can be easily wrong, this is not my area, but afaics no. Note that you get the FMODE_PATH file (see O_PATH), you can do almost nothing with it. So. IIUC with this patch you can do fd = prctl(PR_OPEN_EXE_FILE); execveat(fd, "", NULL, NULL, AT_EMPTY_PATH); and execveat should succeed even if the binary was unlinked/renamed in between. otoh it should fail if, say, you do "chmod a-x exename" in between. However. This won't work after chroot() so I am not sure this solves your problems. > but I assume you know the security > implications of that more than I do. Unlikely ;) > > But to be honest, I can't understand the problem, because I know nothing > > about nommu. > > > > You need to unblock parent sleeping in vfork(), and you can't do another > > fork (I don't undestand why). > > A nommu system doesn't have a memory management unit, so all addresses > are physical addresses. This means two processes can't see different > things at the same address: either they see the same thing or one of > them can't see that address (due to a range register making it). Yes, yes, I understand, and thanks for your detailed explanation... > > Perhaps the child can create another thread? The main thread can exit > > after that and unblock the parent. Or perhaps even something like > > clone(CLONE_VM | CLONE_PARENT), I dunno... > > Launching a new thread doesn't unblock the parent. Well, this doesn't really matter, but see above, the main thread can exit after that. This should unblock the parent. > And even without that, we're still in the "vfork but add concurrency" > territory. Your threads don't have their own independent mappings, Of course! Just I misinterpreted your initial email as if this is fine for your use-case, and all you need is unblock the parent and nothing else. Oleg. --- --- x/kernel/sys.c +++ x/kernel/sys.c @@ -2183,6 +2183,40 @@ static int propagate_has_child_subreaper(struct task_struct *p, void *data) return 1; } +static int open_mm_exe_file(void) +{ + struct file *exe_file, *file; + struct path *path; + int fd = -ENOENT; + + exe_file = get_mm_exe_file(current->mm); + if (!exe_file) + goto out; + + path = &exe_file->f_path; + if (!path->dentry) + goto put_exe_file; + + fd = get_unused_fd_flags(O_CLOEXEC); // flags? + if (fd < 0) + goto put_exe_file; + + file = dentry_open(path, O_PATH, current_cred()); + if (IS_ERR(file)) { + put_unused_fd(fd); + fd = PTR_ERR(file); + goto put_exe_file; + } + + path_get(path); + fd_install(fd, file); + +put_exe_file: + fput(exe_file); +out: + return fd; +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2196,6 +2230,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = 0; switch (option) { + case PR_OPEN_EXE_FILE: + error = open_mm_exe_file(); + break; case PR_SET_PDEATHSIG: if (!valid_signal(arg2)) { error = -EINVAL; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-12 15:45 ` Oleg Nesterov @ 2017-09-13 14:20 ` Oleg Nesterov 0 siblings, 0 replies; 13+ messages in thread From: Oleg Nesterov @ 2017-09-13 14:20 UTC (permalink / raw) To: Rob Landley Cc: Alan Cox, Geert Uytterhoeven, Linux Embedded, dalias, linux-kernel@vger.kernel.org On 09/12, Oleg Nesterov wrote: > > On 09/12, Rob Landley wrote: > > > > On 09/11/2017 10:15 AM, Oleg Nesterov wrote: > > > On 09/08, Rob Landley wrote: > > >> > > >> So is exec(NULL, argv, envp) a reasonable thing to want? > > > > > > I think that something like prctl(PR_OPEN_EXE_FILE) which does > > > > > > dentry_open(current->mm->exe_file->path, O_PATH) > > > > > > and returns fd make more sense. > > > > > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH). > > I'm all for it? That sounds like a cosmetic difference, a more verbose > > way of achieving the same outcome. > > Simpler to implement. Something like the (untested) patch below. Not sure > it is correct, not sure it is good idea, etc. OTOH... with the trivial patch below execveat(AT_FDCWD, "", NULL, NULL, AT_EMPTY_PATH); should always work, even if the binary is not in scope after chroot, or if it is no longer executable, or unlinked. But I am not sure what else should we do to avoid the security problems. Oleg. --- x/fs/exec.c +++ x/fs/exec.c @@ -832,23 +832,32 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) { struct file *file; int err; - struct open_flags open_exec_flags = { - .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, - .acc_mode = MAY_EXEC, - .intent = LOOKUP_OPEN, - .lookup_flags = LOOKUP_FOLLOW, - }; - - if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0) - return ERR_PTR(-EINVAL); - if (flags & AT_SYMLINK_NOFOLLOW) - open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW; - if (flags & AT_EMPTY_PATH) - open_exec_flags.lookup_flags |= LOOKUP_EMPTY; - file = do_filp_open(fd, name, &open_exec_flags); - if (IS_ERR(file)) - goto out; + if (fd == AT_FDCWD && name->name[0] == '\0' && flags == AT_EMPTY_PATH) { + file = get_mm_exe_file(current->mm); + if (!file) { + file = ERR_PTR(-ENOENT); + goto out; + } + } else { + struct open_flags open_exec_flags = { + .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, + .acc_mode = MAY_EXEC, + .intent = LOOKUP_OPEN, + .lookup_flags = LOOKUP_FOLLOW, + }; + + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0) + return ERR_PTR(-EINVAL); + if (flags & AT_SYMLINK_NOFOLLOW) + open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW; + if (flags & AT_EMPTY_PATH) + open_exec_flags.lookup_flags |= LOOKUP_EMPTY; + + file = do_filp_open(fd, name, &open_exec_flags); + if (IS_ERR(file)) + goto out; + } err = -EACCES; if (!S_ISREG(file_inode(file)->i_mode)) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: execve(NULL, argv, envp) for nommu? 2017-09-06 1:12 ` Rob Landley 2017-09-08 21:18 ` Rob Landley @ 2017-09-11 18:14 ` Alan Cox 1 sibling, 0 replies; 13+ messages in thread From: Alan Cox @ 2017-09-11 18:14 UTC (permalink / raw) To: Rob Landley Cc: Geert Uytterhoeven, Linux Embedded, Oleg Nesterov, dalias, linux-kernel@vger.kernel.org > It's not the performance cost, it's rewriting all the pointers. Which you don't need to do > Without address translation, copying the existing mappings to a new > range requires finding and adjusting every pointer to the old data, No it doesn't. See Minix. When you fork() rather than vfork you stick a copy of any non-relocatable elements (typically DATA copy + BSS + stack with a sane CPU and compiler) into a buffer and you swap them over with the real copy when you task switch to the one in the wrong place. If you start the child first you usually only take one copy. I've always been amused that Linux NOMMU hasn't managed to grow a feature that people successfully implemented on 68000 long long ago, and I believe some other processors back to v6/v7 days. Alan ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-09-13 19:33 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-09-05 7:34 execve(NULL, argv, envp) for nommu? Rob Landley 2017-09-05 9:00 ` Geert Uytterhoeven 2017-09-05 13:24 ` Alan Cox 2017-09-06 1:12 ` Rob Landley 2017-09-08 21:18 ` Rob Landley 2017-09-11 15:15 ` Oleg Nesterov 2017-09-12 10:48 ` Rob Landley 2017-09-12 11:30 ` Geert Uytterhoeven 2017-09-12 13:45 ` Rob Landley 2017-09-13 19:33 ` Alan Cox 2017-09-12 15:45 ` Oleg Nesterov 2017-09-13 14:20 ` Oleg Nesterov 2017-09-11 18:14 ` Alan Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).