* Re: [uml-devel] new virtualization syscall to improve uml performance? @ 2002-01-18 23:35 Jeff Dike 2002-01-21 0:28 ` Eric W. Biederman 0 siblings, 1 reply; 5+ messages in thread From: Jeff Dike @ 2002-01-18 23:35 UTC (permalink / raw) To: linux-kernel baccala@freesoft.org said: > First, kudos to everyone who worked on user mode linux. Thanks! > Anyway, I was reading about the design of UML, and it seems to me that > its performance could be improved by adding a split privilege concept > to Linux processes. A "normal" process would be "privileged". > However, to support things like UML, a new syscall could put the > process into "unprivileged" mode, which would cause any traps or > faults (like syscalls or SEGVs) to drop the process into "privileged" > mode at a controlled entry point. This is an interesting idea. All signals would have to drop you back into privileged mode, and syscalls would invoke the SIGTRAP handler (I'm not that fond of this, but it works and it's more or less the way syscall interception is done now (the process SIGTRAP handler isn't called, but the tracer is woken up with the child being sent a SIGTRAP)). I was planning on adding a new slow syscall path (enabled with PTRACE_SIGSYSCALL or something) which delivers a SIGTRAP to the process and turns off PTRACE_SIGSYSCALL for the duration of the handler. Your idea would result in basically the same code, but with a much more sensible interface to it. Mine would add yet another wart to ptrace, making it even more toadlike than it is now. The notion of two process privilege levels is much cleaner and more general. > Adding an extra bit to the mmap/ > mprotect protection flags could specify memory mappings only > accessible from privileged mode. And this knocks off another problem. This would allow UML to unmap kernel text and data while in unprivileged mode without the huge performance penalty it has now with mprotecting it by hand. Though, since processes are normally in privileged mode, I would turn that flag around and say that in unprivileged mode, only specially marked mappings are available. This possibly ties in well with something else I have planned. By adding an interface to create, manipulate, and destroy address spaces, it will be possible for one thread to have a pool of address spaces available to it which it can switch between as needed. This will allow UML to have one host thread per virtual processor (instead of one per UML thread, currently) and one address space per UML thread, and switch the one host process from address space to address space on each context switch. This would solve a bunch of UML problems in one shot. Another thing I was trying to figure out how to do cleanly once this is working is putting the UML kernel in its own address space. This would give UML processes the full 3G address space they expect and make UML completely invisible to them. Of course, the problem is how do you switch address spaces on every signal and system call. Let's say there's a new system call, unprivilege(), and it optionally takes an address space handle (which would be a file descriptor). Then, any switch back to privileged mode would first switch back to that address space. This seems clean to me. One thing that's unclear to me is how you enter unprivileged mode in the first place. I guess you'd specify a procedure in the unprivilege() and it would be called in unprivileged mode. This looks like a very good idea to me and it seems like it would cleanly solve a bunch of problems for UML. Jeff ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [uml-devel] new virtualization syscall to improve uml performance? 2002-01-18 23:35 [uml-devel] new virtualization syscall to improve uml performance? Jeff Dike @ 2002-01-21 0:28 ` Eric W. Biederman 2002-01-21 2:28 ` Jeff Dike 0 siblings, 1 reply; 5+ messages in thread From: Eric W. Biederman @ 2002-01-21 0:28 UTC (permalink / raw) To: Jeff Dike; +Cc: linux-kernel It sounds like there are a couple of good ideas here. Let me add my refinements. new_addr(); /* to get a secondary address space */ struct sandbox_params { int return_reason; int return_data; int eax; int ebx; }; run_sandbox(int address_space, struct sandbox_params *params); /* to start a sandbox */ int fmmap(int address_space, void *start, size_t length, int prot, int flags, int fd, off_t offset); int fmunmap(int addresss_space, void *start, size_t length); With the secondary address spaces being completely setup by uml. And run_sandbox being the entry/exit point. The nice thing here is that because they would share the same kernel stack/process most registers can be left in registers. With run_sandbox putting as much as possible on a fast path. And then new_addr, fmmap, fmunmap would be all that you would really need to manipulate those address spaces. Usually processors only support a kernel/user space differentiation in their page tables, and the sometimes support caching multiple address spaces simultaneously cached in their tlbs. So I have designed this interface to take advantage of the common processor features, and additionally look as much like normal process execution as possible. Any other implementation would need someone manually modify the page tables, either the kernel or uml calling mprotect. Any trap taken in the sandboxed address space should fill the appropriate fields in struct sandbox_params and switch address spaces back to the master process. This interface is as cheap as I can imagine making it. And with a little care can be really optimized on the kernel side if uml becomes a common case. Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [uml-devel] new virtualization syscall to improve uml performance? 2002-01-21 0:28 ` Eric W. Biederman @ 2002-01-21 2:28 ` Jeff Dike 2002-01-21 6:00 ` Eric W. Biederman 0 siblings, 1 reply; 5+ messages in thread From: Jeff Dike @ 2002-01-21 2:28 UTC (permalink / raw) To: Eric W. Biederman, baccala; +Cc: linux-kernel I wrote up my thoughts on secondary address spaces on the uml-devel list (http://www.geocrawler.com/lists/3/SourceForge/709/75/7527174/). I'll try to be somewhat briefer here. ebiederm@xmission.com said: > new_addr(); /* to get a secondary address space */ I asked Linus about this at the kernel summit last year, and he said he wanted a filesystem interface, not a set of new system calls. So, my proposed interface (modulo the names, which I welcome improvements to): Open of: /proc/mm - returns a file descriptor referring to a new, empty mm_struct /proc/self/mm - returns a file descriptor referring to current->mm /proc/<pid>/mm - returns a file descriptor referring to the mm of process <pid> > int fmmap(int address_space, void *start, size_t length, int prot, > int flags, int fd, off_t offset); > int fmunmap(int addresss_space, void *start, size_t length); My proposal for this is to extend mmap: void *new_mmap(void *start, size_t length, int prot, int flags, int src_fd, off_t offset, int dest_fd); The new thing is the addition of dest_fd, which refers to the object within which the new mapping is to be made. dest_fd == -1 refers to the current address space. I intend for dest_fd to be an address space descriptor, but it seems to make sense for it to be anything that supports mmap. munmap and mprotect would be similarly extended. > run_sandbox(int address_space, struct sandbox_params *params); /* to > start a sandbox */ > And run_sandbox being the entry/exit point. Are you saying that you'd call run_sandbox to switch address spaces and enter unprivileged mode, and when you re-enter privileged mode, the run_sandbox call returns in the original address space with a bunch of information in params? If so, then > The nice thing here is > that because they would share the same kernel stack/process most > registers can be left in registers. is wrong. You need to preserve two kernel contexts, so you need two kernel stacks. The run_sandbox context is obviously one. The unprivileged code would also need to enter the kernel to fill in the sandbox params and force a return from run_sandbox. Also, depending on the arch, CPU traps run on the current kernel stack. Otherwise, I like this idea. But, I don't like mixing the process privilege and address space ideas together like this. UML, like i386, grabs the top of the address space for itself (it grabs 0xa0000000 - 0xc0000000), so user/kernel space transitions don't require an address space switch. To require that a privileged/unprivileged transition also switch address spaces will put a speed limit on UML system calls. That's why I like the idea of maps that are only available in privileged mode. Turning on some mappings seems a lot cheaper than a full address space switch. Having said that, I like the address space switch to be available as an option. There are some UML applications (like honeypots) where having the UML kernel in a totally different address space would be very useful. Somewhat unrelated, but another thing I've been thinking about is whether the process privilege idea could be used to implement strace. One difficulty is that strace wants to record system calls, not nullify them as UML does. There doesn't seem to be any room for allowing the system call or signal handler to proceed in unprivileged mode. Jeff ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [uml-devel] new virtualization syscall to improve uml performance? 2002-01-21 2:28 ` Jeff Dike @ 2002-01-21 6:00 ` Eric W. Biederman 2002-01-21 20:21 ` Jeff Dike 0 siblings, 1 reply; 5+ messages in thread From: Eric W. Biederman @ 2002-01-21 6:00 UTC (permalink / raw) To: Jeff Dike; +Cc: baccala, linux-kernel Jeff Dike <jdike@karaya.com> writes: > I wrote up my thoughts on secondary address spaces on the uml-devel list > (http://www.geocrawler.com/lists/3/SourceForge/709/75/7527174/). I'll try > to be somewhat briefer here. O.k. the summary pretty much matches what I was thinking. If I get deep into it I'll have to read that article. > ebiederm@xmission.com said: > > new_addr(); /* to get a secondary address space */ > > I asked Linus about this at the kernel summit last year, and he said he > wanted a filesystem interface, not a set of new system calls. I guess I can see that. > So, my proposed interface (modulo the names, which I welcome improvements to): > > Open of: > /proc/mm - returns a file descriptor referring to a new, empty mm_struct > /proc/self/mm - returns a file descriptor referring to current->mm > /proc/<pid>/mm - returns a file descriptor referring to the mm of process <pid> > > > int fmmap(int address_space, void *start, size_t length, int prot, > > int flags, int fd, off_t offset); > > int fmunmap(int addresss_space, void *start, size_t length); > > My proposal for this is to extend mmap: > > void *new_mmap(void *start, size_t length, int prot, int flags, > int src_fd, off_t offset, int dest_fd); > > The new thing is the addition of dest_fd, which refers to the object within > which the new mapping is to be made. dest_fd == -1 refers to the current > address space. I intend for dest_fd to be an address space descriptor, but > it seems to make sense for it to be anything that supports mmap. Currently that is only address spaces. > munmap and mprotect would be similarly extended. Right. > > run_sandbox(int address_space, struct sandbox_params *params); /* to > > start a sandbox */ > > And run_sandbox being the entry/exit point. > > Are you saying that you'd call run_sandbox to switch address spaces and > enter unprivileged mode, and when you re-enter privileged mode, the run_sandbox > call returns in the original address space with a bunch of information in > params? That is what I was thinking. > If so, then > > > The nice thing here is > > that because they would share the same kernel stack/process most > > registers can be left in registers. > > is wrong. You need to preserve two kernel contexts, so you need two kernel > stacks. ??? The run_sandbox idea is essentially what the current vm86 does except a little more optimized... I admit I left out a value for the instruction pointer in the sandbox_params (which would necessarily be address space dependent). For whatever state run_sandbox would need to return to the original address space it could simply stored on the kernel stack. I really don't see why you would need to kernel contexts. > The run_sandbox context is obviously one. The unprivileged code > would also need to enter the kernel to fill in the sandbox params and force a > return from run_sandbox. Also, depending on the arch, CPU traps run on the > current kernel stack. A trap or whatever is part of the return for the run_sandbox context. The current vm86 system call already does something similar to this. The 8 general purpose registers are saved and restored but the floating point registers are passed through. The tricky addition would be having multiple address spaces per process. > Otherwise, I like this idea. I still don't see why it takes two kernel stacks to pull this off. > But, I don't like mixing the process privilege and address space ideas > together like this. UML, like i386, grabs the top of the address space > for itself (it grabs 0xa0000000 - 0xc0000000), so user/kernel space transitions > don't require an address space switch. To require that a > privileged/unprivileged transition also switch address spaces will put a > speed limit on UML system calls. I will agree that it is sane for the run_sandbox command to work on the current address space as well. But I don't like the idea of an implied mprotect. As currently there isn't any hardware to implement it and I don't see why anyone would make such hardware I think that part should stay as two calls. The only reason I can see for an implied mprotect is if executing the mprotect keeps you from executing the run_sandbox command... > That's why I like the idea of maps that are only available in privileged mode. > Turning on some mappings seems a lot cheaper than a full address space switch. Maybe I'm confused. The only cost of an address space switch is the tlb flush and reload cost. (granted that is significant for short code stretches). But more modern architectures are implementing address space numbers or their kin so they can keep multiple address spaces in the tlb at once. With address space numbers an address space switch is practically free. The only performance hit I see is with the copy_to_user, copy_from_user routines. > Having said that, I like the address space switch to be available as an > option. There are some UML applications (like honeypots) where having the > UML kernel in a totally different address space would be very useful. > > Somewhat unrelated, but another thing I've been thinking about is whether > the process privilege idea could be used to implement strace. One difficulty > is that strace wants to record system calls, not nullify them as UML does. > There doesn't seem to be any room for allowing the system call or signal > handler to proceed in unprivileged mode. Except by totally emulating it, which is fairly invasive, but it is good for a real sandbox case where we want to make decisions. I suspect there is some happy compromise case Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [uml-devel] new virtualization syscall to improve uml performance? 2002-01-21 6:00 ` Eric W. Biederman @ 2002-01-21 20:21 ` Jeff Dike 0 siblings, 0 replies; 5+ messages in thread From: Jeff Dike @ 2002-01-21 20:21 UTC (permalink / raw) To: Eric W. Biederman; +Cc: baccala, linux-kernel ebiederm@xmission.com said: > For whatever state run_sandbox would need to return to the original > address space it could simply stored on the kernel stack. I really > don't see why you would need to kernel contexts. Yeah, it didn't occur to me until later that the run_sandbox state could be stored without occupying a kernel stack. Unless I'm missing something, the implementation of run_sandbox would be that it stores the privileged context away somewhere, restores the unprivileged context (which would then have to be a full register set), and returns to userspace. So, the privileged context couldn't be on the kernel stack. It would have to be in (or hanging off) the task structure or something. > The tricky addition would be having multiple address spaces per > process. It's not multiple address spaces per process so much as it is treating addresses as objects completely separate from processes, and processes can switch between them as they see fit. So, by opening up some other process's /proc/<pid>/mm and switching to it (assuming permissions were OK), you could invade that address space. You would immediately segfault or something because your registers would be all wrong, but the switch itself would work fine. > But I don't like the idea of an implied mprotect. As currently there > isn't any hardware to implement it and I don't see why anyone would > make such hardware I think that part should stay as two calls. Hardware isn't the issue. Atomicity is. I want the UML kernel to disappear when it switches to userspace, just as the native kernel does. In order to do this, the unprivilege-ing and unmapping of the UML kernel have to happen in the same system call. Similarly, the remapping has to happen at the same time as the return from unprivilege(). That's why I like the idea of having (say) MAP_UNPRIV mappings be the only ones available in unprivileged mode. > The only cost of an address space switch is the tlb flush and reload > cost. (granted that is significant for short code stretches). But > more modern architectures are implementing address space numbers or > their kin so they can keep multiple address spaces in the tlb at once. > With address space numbers an address space switch is practically > free. You may be right. I'm not an expert on this. I'd like to keep these two ideas separate since they seem separately useful, but still have them interact where it makes sense (i.e. creating an unprivileged context in a different address space). > The only performance hit I see is with the copy_to_user, > copy_from_user routines. Yeah, but that wouldn't be much of an issue for UML since, if necessary, it can do a virt_to_phys and grab data from physical memory, which will be in its address space. > Except by totally emulating it, which is fairly invasive, but it is > good for a real sandbox case where we want to make decisions. Yeah, it's perfect for UML. The reason I'm interested in strace is that if this can be demonstrated to cleanly replace pieces of ptrace, I think it will have an automatic fan club. I don't see why you can't allow an unprivileged context to just continue, so the system call will proceed or the signal will be delivered, which would be fine for strace (at least starting the process from scratch, not sure what to do about attaching to a running process). To me, this is suggesting an fcntl/ptrace-like interface for performing various operations on unprivileged contexts: status = unprivilege(UNPRIV_CREATE, sp, proc, arg, context) - would create a new unprivileged context running proc(arg) on the stack pointed to by sp, very similar to clone. context is a buffer large enough to hold the userspace state of the context. This will return immediately with the context filled in. We have forward compatibility issues with the size of that context buffer potentially needing to grow as registers are added, and old binaries overflowing their static small buffers. So, we have a call to ask how big the buffer should be: size = unprivilege(UNPRIV_BUFSIZE) status = unprivilege(UNPRIV_RUN, context) - runs the unprivileged context contained in the context buffer. Returns some kind of status when the context makes a system call or receives a signal. The context buffer also contains information about the event that prompted the return. Maybe add some flags to indicate that we're only interested in some types of events. status = unprivilege(UNPRIV_CANCEL, context) - cancels the pending event that caused the return to privileged context. The system call doesn't happen or the signal isn't delivered. Returns as UNPRIV_RUN does. UML do something like this: size = unprivilege(UNPRIV_BUFSIZE); context = kmalloc(size); /* proc is a little stub that branches into userspace */ status = unprivilege(UNPRIV_CREATE, sp, proc, arg, context); /* Start it going */ unprivilege(UNPRIV_RUN, context); while(1){ /* read the system call or signal out of context and either run the * system call in UML or handle the signal */ /* cancel the syscall or signal */ unprivilege(UNPRIV_CANCEL, context); } strace would do pretty much the same thing, except it would call UNPRIV_RUN instead of UNPRIV_CANCEL in the loop. What do you think about this? Jeff ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2002-01-21 20:21 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-01-18 23:35 [uml-devel] new virtualization syscall to improve uml performance? Jeff Dike 2002-01-21 0:28 ` Eric W. Biederman 2002-01-21 2:28 ` Jeff Dike 2002-01-21 6:00 ` Eric W. Biederman 2002-01-21 20:21 ` Jeff Dike
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox