* Syscall numbers for BProc @ 2003-04-04 19:32 Erik Hendriks 2003-04-04 19:35 ` Christoph Hellwig 0 siblings, 1 reply; 11+ messages in thread From: Erik Hendriks @ 2003-04-04 19:32 UTC (permalink / raw) To: torvalds; +Cc: linux-kernel Is it possible to get a Linux system call number allocated for BProc? (http://sourceforge.net/projects/bproc) I've been using arbitrary system call numbers for a while but there have been collisions with new kernel features. I'd like to avoid that in the future. BProc currently works on 2.4.x kernels on x86, alpha and ppc (32bit). Thanks, - Erik ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-04 19:32 Syscall numbers for BProc Erik Hendriks @ 2003-04-04 19:35 ` Christoph Hellwig 2003-04-04 20:43 ` hendriks 2003-04-05 0:44 ` hendriks 0 siblings, 2 replies; 11+ messages in thread From: Christoph Hellwig @ 2003-04-04 19:35 UTC (permalink / raw) To: Erik Hendriks; +Cc: torvalds, linux-kernel On Fri, Apr 04, 2003 at 12:32:18PM -0700, Erik Hendriks wrote: > Is it possible to get a Linux system call number allocated for BProc? > (http://sourceforge.net/projects/bproc) I've been using arbitrary > system call numbers for a while but there have been collisions with > new kernel features. I'd like to avoid that in the future. BProc > currently works on 2.4.x kernels on x86, alpha and ppc (32bit). Please explain why you need syscalls and the exact APIs. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-04 19:35 ` Christoph Hellwig @ 2003-04-04 20:43 ` hendriks 2003-04-04 21:54 ` H. Peter Anvin 2003-04-05 0:44 ` hendriks 1 sibling, 1 reply; 11+ messages in thread From: hendriks @ 2003-04-04 20:43 UTC (permalink / raw) To: Christoph Hellwig, torvalds, linux-kernel On Fri, Apr 04, 2003 at 08:35:31PM +0100, Christoph Hellwig wrote: > On Fri, Apr 04, 2003 at 12:32:18PM -0700, Erik Hendriks wrote: > > Is it possible to get a Linux system call number allocated for BProc? > > (http://sourceforge.net/projects/bproc) I've been using arbitrary > > system call numbers for a while but there have been collisions with > > new kernel features. I'd like to avoid that in the future. BProc > > currently works on 2.4.x kernels on x86, alpha and ppc (32bit). > > Please explain why you need syscalls and the exact APIs. bproc provides a bunch of transparent remote management for remote processes in a cluster. (i.e. 'ps' shows everythign in the cluster, kills work remotely, etc.) There are also process migration mechanisms to place processes on different nodes in the cluster. The process migration stuff is what I want the system call for. The simplest case is the "bproc_move(int x)" call which a process uses to move itself to another node. In order to migrate a process to another node, BProc needs access to ALL the processors's state. Normally, this stuff is saved at system call entry and it appears on the stack for the system call handler. However, on some architectures (e.g. alpha) not all the cpu state is saved there. Some of the registers (the callee-saved ones - I'm a little fuzzy on terminology here) are saved elsewhere (to unpredictable locations on the stack) and I can't get to them anymore if the syscall handler doesn't get called directly. As a result, on alpha, I have a special bit of ASM code as the syscall handler which saves all the registers right away. This system call entry is very similar to the special system call entry for fork on that architecture. The same problem exists on the receiving end of the process migration - except in that case the process registers need to be written. On most architectures you might be able to get away with the assumption that registers are at offset X from the beginning of the kernel stack. I would regard that as pretty ugly though. - Erik BProc's process migration and The way it's implemented today, the process migration calls depend on stack build-up at syscall entry. Specifically, if bproc is going to replace the contents of a process it needs to be able to modify all the registers in the user process. On some architectures (e.g. alpha) not everything is saved at syscall entry. - Erik ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-04 20:43 ` hendriks @ 2003-04-04 21:54 ` H. Peter Anvin 0 siblings, 0 replies; 11+ messages in thread From: H. Peter Anvin @ 2003-04-04 21:54 UTC (permalink / raw) To: linux-kernel Followup to: <20030404204344.GF15620@lanl.gov> By author: hendriks@lanl.gov In newsgroup: linux.dev.kernel > > > > Please explain why you need syscalls and the exact APIs. > [ ... ] OK, you've answered the first question, could you please answer the second one? Thanks, -hpa P.S. I'm in favour of this request, but we still need the answers... -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-04 19:35 ` Christoph Hellwig 2003-04-04 20:43 ` hendriks @ 2003-04-05 0:44 ` hendriks 2003-04-05 5:45 ` Christoph Hellwig 1 sibling, 1 reply; 11+ messages in thread From: hendriks @ 2003-04-05 0:44 UTC (permalink / raw) To: Christoph Hellwig, torvalds, linux-kernel On Fri, Apr 04, 2003 at 08:35:31PM +0100, Christoph Hellwig wrote: > On Fri, Apr 04, 2003 at 12:32:18PM -0700, Erik Hendriks wrote: > > Is it possible to get a Linux system call number allocated for BProc? > > (http://sourceforge.net/projects/bproc) I've been using arbitrary > > system call numbers for a while but there have been collisions with > > new kernel features. I'd like to avoid that in the future. BProc > > currently works on 2.4.x kernels on x86, alpha and ppc (32bit). > > Please explain why you need syscalls and the exact APIs. Here's the answer to the second half... The kernel API is reasonably large so let me know if anybody wants more detail anywhere. The C library is basically a 1:1 mapping for these calls. Also, if anybody wants to know more about BProc in general, they can check out: http://public.lanl.gov/cluster/papers/papers/hendriks-ics02.pdf API for the syscall: arg0 is a function number and meaning of the rest of the arguments depends on that. For arg0: 0x0001 - BPROC_SYS_VERSION - get BProc version arg1 is a pointer to a bproc_version_t which gets filled in. return value: 0 on success, -errno on error 0x0002 - BPROC_SYS_DEBUG - undocumented debugging call a magic debugging hook who's argument meanings are fluid, currently it does: arg1 = 0 - return number of children by checking pptr and opptr. arg1 = 1 - return true/false indicating whether wait() will be local. arg1 = 2 - return value of nlchild (internal BProc process book keeping val) arg1 = 3 - perform process ID sanity check and return information about pptr and opptr in linux and BProc. 0x0003 - BPROC_SYS_MASTER - get master daemon file descriptor. The master daemon reads/writes messages to/from kernel space with this file descriptor. no arguments return value: a new file descriptor or -errno on failure. 0x0004 - BPROC_SYS_SLAVE - get slave daemon file descriptor. The slave daemon reads/writes messages to/from kernel space with this file descriptor. no arguments return value: a new file descriptor or -errno on failure. 0x0005 - BPROC_SYS_IOD - get I/O daemon file descriptor. The I/O daemon reads/writes messages to/from kernel space with this file descriptor. no arguments return value is a new file descriptor or -errno on failure. 0x0006 - BPROC_SYS_NOTIFY - get notifier file descriptor. An app can get a file descriptor which becomes ready for reading whenever a BProc machine state change happens. no arguments return value: a new file descriptor or -errno on failure. 0x0201 - BPROC_SYS_INFO - get information on node status arg1 pointer to an array of bproc_node_info_t's (first element contains index of last node seen, nodes returned will start with next node) arg2 number of elements in array return value: number of elements returned on success, -errno on error 0x0202 - BPROC_SYS_STATUS - set node status arg1 node number arg2 new node state return value: 0 on success, -errno on error 0x0203 - BPROC_SYS_CHOWN - change node permission bits (perms are file-like) arg1 node number arg2 new node perm return value: 0 on success, -errno on error 0x0207 - BPROC_SYS_CHROOT - ask slave node to chroot() arg1 node number arg2 pointer to patch to chroot() to. return value: 0 on success, -errno on error 0x0208 - BPROC_SYS_REBOOT - ask slave node to reboot arg1 node number return value: 0 on success, -errno on error 0x0209 - BPROC_SYS_HALT - ask slave node to halt arg1 node number return value: 0 on success, -errno on error 0x020A - BPROC_SYS_PWROFF - ask slave node to power off arg1 node number return value: 0 on success, -errno on error 0x020B - BPROC_SYS_PINFO - get information about location of remote processes arg1 pointer to an array of bproc_proc_info_t's (first element contains index of last proc seen, procs returned will start with next node) arg2 number of elements in array return value: number of elements returned on success, -errno on error 0x020E - BPROC_SYS_RECONNECT - ask slave daemon to reconnect arg1 node number arg2 pointer to bproc_connect_t which contains 2 sockaddrs - a local and remote address for the slave to use when re-connecting to the master. return value: 0 on success, -errno on error 0x0301 - BPROC_SYS_REXEC - remote exec (replace current process with exec performed on remote node) arg1 node number arg2 pointer to bproc_move_t (contains exec args, io setup info, etc) return value: no return (it's an exec) on success, -errno on error 0x0302 - BPROC_SYS_MOVE - move caller to remote node arg1 node number arg2 pointer to bproc_move_t (contains flags, io setup info, etc) arg2 move flags (how much of the memory space gets sent) return value: 0 on success, -errno on error 0x0303 - BPROC_SYS_RFORK - fork a child onto another node. This is a combination of the fork and move calls with semantics such that no child process is ever created (from the parent's point of view) if the move step fails. arg1 node number arg2 pointer to bproc_move_t (contains flags, io setup info, etc) return value: parent: child pid on success, -errno on error child: 0 0x0304 - BPROC_SYS_EXECMOVE - exec and then move. This is a combination of the xec and move syscalls. This call performs an exec and then moves the resulting process image to a remote node before it is allowed to run. This is used to place images of programs which are not BProc aware on remote nodes. arg1 node number arg2 pointer to bproc_move_t (contains exec args, io setup info, etc.) return value: no return on success, -errno on error if error happens in exec(). If error happens during the move step the process will exit with errno as its status. 0x0306 - BPROC_SYS_VRFORK - vector rfork - create many child processes on many remote nodes efficiently. arg1 pointer to bproc_move_t. This contains the number of children to create, a list of nodes to move to, an array to store the resulting child process IDs, and possibly IO setup information. return value: parent: number of nodes or -errno on error. pid array contains pids or -errno for each child. child: rank in list of nodes (0 .. n-1) 0x0307 - BPROC_SYS_EXEC - use master node to perform exec. A process running on a slave node can ask it's "ghost" on the front end to perform an exec for it. The results of that exec will replace the process on the slave node. arg1 pointer to bproc_move_t (contains execve args) return value: no return on success, -errno on failure. 0x0309 - BPROC_SYS_VEXECMOVE - vector execmove - create many child processes on many remote nodes efficiently. The child process image is the result of the supplied execve. arg1 pointer to bproc_move_t. This contains the number of children to create, a list of nodes to move to, an array to store the resulting child process IDs, execve args and possibly IO setup information. return value: parent: number of nodes or -errno on failure. The array children: no return. If BPROC_RANK=XXXXXXX exists in the environment, vexecmove will replace the Xs with the child's rank in vexecmove. 0x1000 - at this offset, bproc provides an interface to virtual memory area dumper (vmadump). VMADump is the process save/restore mechanism that BProc uses internally. 0x1000 - VMAD_DO_DUMP - dump the calling process's image to a file descriptor arg1 file descriptor arg2 dump flags (controls which regions are dumped and which regions are stored as references to files.) return value: during dump: number of bytes written to the file descriptor. during undump: 0 (when the process image is restored, it will start by returning from this system call) 0x1001 - VMAD_DO_UNDUMP - restore a process image from a file descriptor. The new image will replace the calling process just like exec. arg1 file descriptor return value: no return on success, -errno on failure. side note: where possible, vmadump adds a binary format for dump files which allows a dump stored in a file to be executed directly. 0x1002 VMAD_DO_EXECDUMP - this is a combination of the exec and dump system calls. An exec is performed and the resulting process image is dumped to a file descriptor. The process will exit after the dump is complete. arg1 pointer to vmadump_execdump_args (contains execve arguments, a file descriptor to dump to and dump flags) return value: no return on success, -errno on failure. VMADump has a "library list" in kernel space. This is the list of files it presumes are available at undump time if you ask it not to dump "libaries". 0x1030 - VMAD_LIB_CLEAR - clear the library list no arguments return value: 0 on success, -errno no failure. 0x1031 - VMAD_LIB_ADD - add a new library to the list arg1 pointer to filename arg2 length of filename return value: 0 on success, -errno no failure. 0x1032 - VMAD_LIB_DEL - delete a library from the list arg1 pointer to filename arg2 length of filename return value: 0 on success, -errno no failure. 0x1033 - VMAD_LIB_LIST - get the contents of the library list arg1 pointer to region to store library list arg2 size of region The list of libraries is store as a null delimited list of strings. return value: number of bytes stored on success, -errno no failure. 0x1034 - VMAD_LIB_SIZE - get the size in bytes of the library list no arguments return value: size in bytes of the library list ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-05 0:44 ` hendriks @ 2003-04-05 5:45 ` Christoph Hellwig 2003-04-05 20:15 ` hendriks 0 siblings, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2003-04-05 5:45 UTC (permalink / raw) To: hendriks; +Cc: Christoph Hellwig, torvalds, linux-kernel On Fri, Apr 04, 2003 at 05:44:28PM -0700, hendriks@lanl.gov wrote: > Here's the answer to the second half... The kernel API is reasonably > large so let me know if anybody wants more detail anywhere. The C > library is basically a 1:1 mapping for these calls. > > Also, if anybody wants to know more about BProc in general, they can > check out: http://public.lanl.gov/cluster/papers/papers/hendriks-ics02.pdf > > API for the syscall: > > arg0 is a function number and meaning of the rest of the arguments > depends on that. > > For arg0: Stop right here. We don't want even more muliplexer syscalls. Please untangle this to individual calls. > > 0x0001 - BPROC_SYS_VERSION - get BProc version > arg1 is a pointer to a bproc_version_t which gets filled in. > > return value: 0 on success, -errno on error Scratch this one, syscall ABIs are supposed to be stable. > > 0x0002 - BPROC_SYS_DEBUG - undocumented debugging call a magic > debugging hook who's argument meanings are fluid, currently it does: > arg1 = 0 - return number of children by checking pptr and opptr. > arg1 = 1 - return true/false indicating whether wait() will be local. > arg1 = 2 - return value of nlchild (internal BProc process book keeping val) > arg1 = 3 - perform process ID sanity check and return information about > pptr and opptr in linux and BProc. Debug stuff doesn't need a syscall, please get rid of this one. > 0x0003 - BPROC_SYS_MASTER - get master daemon file descriptor. The master > daemon reads/writes messages to/from kernel > space with this file descriptor. > no arguments > > return value: a new file descriptor or -errno on failure. Shouldn't this better be a new character device? (Dito for the other fd stuff) > 0x0201 - BPROC_SYS_INFO - get information on node status > arg1 pointer to an array of bproc_node_info_t's (first element contains > index of last node seen, nodes returned will start with next node) > arg2 number of elements in array > > return value: number of elements returned on success, -errno on error This should be read() on a special file. > > 0x0202 - BPROC_SYS_STATUS - set node status > arg1 node number > arg2 new node state > > return value: 0 on success, -errno on error Write on a special file. > > 0x0203 - BPROC_SYS_CHOWN - change node permission bits (perms are file-like) So why is this no file, e.g. in sysfs? > 0x0207 - BPROC_SYS_CHROOT - ask slave node to chroot() > arg1 node number > arg2 pointer to patch to chroot() to. Please explain this a bit more. Can't you use namespace properly on the slaves somehow? > 0x0208 - BPROC_SYS_REBOOT - ask slave node to reboot > arg1 node number > > return value: 0 on success, -errno on error > > 0x0209 - BPROC_SYS_HALT - ask slave node to halt > arg1 node number > > return value: 0 on success, -errno on error > > 0x020A - BPROC_SYS_PWROFF - ask slave node to power off > arg1 node number > > return value: 0 on success, -errno on error Can't you just call sys_reboot on the remote node? > > 0x020B - BPROC_SYS_PINFO - get information about location of remote processes > arg1 pointer to an array of bproc_proc_info_t's (first element contains > index of last proc seen, procs returned will start with next node) > arg2 number of elements in array > > return value: number of elements returned on success, -errno on error Should be read() on a special file. > > 0x020E - BPROC_SYS_RECONNECT - ask slave daemon to reconnect > arg1 node number > arg2 pointer to bproc_connect_t which contains 2 sockaddrs - a local > and remote address for the slave to use when re-connecting to > the master. Don't use bproc_connect_t but the real arguments. > 0x0301 - BPROC_SYS_REXEC - remote exec (replace current process with exec > performed on remote node) > arg1 node number > arg2 pointer to bproc_move_t (contains exec args, io setup info, etc) > > return value: no return (it's an exec) on success, -errno on error > > 0x0302 - BPROC_SYS_MOVE - move caller to remote node > arg1 node number > arg2 pointer to bproc_move_t (contains flags, io setup info, etc) > > arg2 move flags (how much of the memory space gets sent) > > return value: 0 on success, -errno on error > > 0x0303 - BPROC_SYS_RFORK - fork a child onto another node. This is a > combination of the fork and move calls with > semantics such that no child process is > ever created (from the parent's point of > view) if the move step fails. > arg1 node number > arg2 pointer to bproc_move_t (contains flags, io setup info, etc) > > return value: parent: child pid on success, -errno on error > child: 0 > > 0x0304 - BPROC_SYS_EXECMOVE - exec and then move. This is a > combination of the xec and move > syscalls. This call performs an exec > and then moves the resulting process > image to a remote node before it is > allowed to run. This is used to place > images of programs which are not BProc > aware on remote nodes. > arg1 node number > arg2 pointer to bproc_move_t (contains exec args, io setup info, etc.) > > return value: no return on success, -errno on error if error happens > in exec(). If error happens during the move step the process will > exit with errno as its status. > > 0x0306 - BPROC_SYS_VRFORK - vector rfork - create many child processes > on many remote nodes efficiently. > > arg1 pointer to bproc_move_t. This contains the number of children > to create, a list of nodes to move to, an array to store the > resulting child process IDs, and possibly IO setup information. > > return value: parent: number of nodes or -errno on error. > pid array contains pids or -errno for each child. > child: rank in list of nodes (0 .. n-1) > > 0x0307 - BPROC_SYS_EXEC - use master node to perform exec. A process > running on a slave node can ask it's "ghost" > on the front end to perform an exec for it. > The results of that exec will replace the > process on the slave node. > arg1 pointer to bproc_move_t (contains execve args) > > return value: no return on success, -errno on failure. > > 0x0309 - BPROC_SYS_VEXECMOVE - vector execmove - create many child > processes on many remote nodes > efficiently. The child process image > is the result of the supplied execve. > > arg1 pointer to bproc_move_t. This contains the number of children > to create, a list of nodes to move to, an array to store the > resulting child process IDs, execve args and possibly IO setup > information. > > return value: parent: number of nodes or -errno on failure. The array > children: no return. If BPROC_RANK=XXXXXXX exists in the > environment, vexecmove will replace the Xs with > the child's rank in vexecmove. > > 0x1000 - at this offset, bproc provides an interface to virtual memory > area dumper (vmadump). VMADump is the process save/restore > mechanism that BProc uses internally. I think all these are pretty generic for any SSI clustering. Could you please talk to the Compaq and Mosix folks about a common API? > 0x1000 - VMAD_DO_DUMP - dump the calling process's image to a file descriptor > arg1 file descriptor > arg2 dump flags (controls which regions are dumped and which regions > are stored as references to files.) > > return value: during dump: number of bytes written to the file descriptor. > during undump: 0 (when the process image is restored, it will > start by returning from this system call) I'm pretty sure this would better be a /proc/<pid>/image file you can read from. > 0x1001 - VMAD_DO_UNDUMP - restore a process image from a file > descriptor. The new image will replace the > calling process just like exec. > arg1 file descriptor > > return value: no return on success, -errno on failure. > > side note: where possible, vmadump adds a binary format for dump files > which allows a dump stored in a file to be executed directly. Can't you always use this binary format? And btw, does this checkpoint and restore code depend on the rest of bproc? I'd love to see it even in normal, not cluster-awaer kernels. > 0x1030 - VMAD_LIB_CLEAR - clear the library list > no arguments What library lists are all those calls about? Needs more explanation. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-05 5:45 ` Christoph Hellwig @ 2003-04-05 20:15 ` hendriks 2003-04-08 19:19 ` H. Peter Anvin 2003-04-10 6:29 ` Christoph Hellwig 0 siblings, 2 replies; 11+ messages in thread From: hendriks @ 2003-04-05 20:15 UTC (permalink / raw) To: Christoph Hellwig, torvalds, linux-kernel On Sat, Apr 05, 2003 at 06:45:59AM +0100, Christoph Hellwig wrote: > On Fri, Apr 04, 2003 at 05:44:28PM -0700, hendriks@lanl.gov wrote: > > Here's the answer to the second half... The kernel API is reasonably > > large so let me know if anybody wants more detail anywhere. The C > > library is basically a 1:1 mapping for these calls. > > > > Also, if anybody wants to know more about BProc in general, they can > > check out: http://public.lanl.gov/cluster/papers/papers/hendriks-ics02.pdf > > > > API for the syscall: > > > > arg0 is a function number and meaning of the rest of the arguments > > depends on that. > > > > For arg0: > > Stop right here. We don't want even more muliplexer syscalls. Please > untangle this to individual calls. The reason it is the way it is because when I'm trying to avoid stomping on other syscalls, having a small foot print is a good thing. BProc will always be a fringe kind of thing. Adding more than a syscall or two seems like quite a bit of polution in the main kernel to me. Similarly, I don't think the main kernel should include the BProc patch. It changes fairly often, isn't 100% unintrusive and would be used by less than .1% of people out there. Breaking out every call into a separate syscall number would also make it more difficult to add new features in the future. > > 0x0001 - BPROC_SYS_VERSION - get BProc version > > arg1 is a pointer to a bproc_version_t which gets filled in. > > > > return value: 0 on success, -errno on error > > Scratch this one, syscall ABIs are supposed to be stable. This version is here only to make sure that the kernel module and master/slave daemon are properly matched - that is they're going to be sending/receiving the same messages from each other. The message interface (which is used ONLY by the daemons) is not stable - it can't be. Fixing that in stone would prevent bug fixes. Things can also go badly in weird ways if the daemons and the kernel code don't agree on what the protocol is. The rest of the ABI, all the other calls here are supposed to be stable. Only the daemons ever look at the version number. > > 0x0002 - BPROC_SYS_DEBUG - undocumented debugging call a magic > > debugging hook who's argument meanings are fluid, currently it does: > > arg1 = 0 - return number of children by checking pptr and opptr. > > arg1 = 1 - return true/false indicating whether wait() will be local. > > arg1 = 2 - return value of nlchild (internal BProc process book keeping val) > > arg1 = 3 - perform process ID sanity check and return information about > > pptr and opptr in linux and BProc. > > Debug stuff doesn't need a syscall, please get rid of this one. It's got to be in some kind of kernel communication layer... All these calls involve answering some kind of question that involves sitting on the task list lock and counting stuff. Little test programs make these calls since they know what the answers *should* be. Whether this one gets built in at all is currently an ifdef. I only included in the list because it's currently built in my default. > > 0x0003 - BPROC_SYS_MASTER - get master daemon file descriptor. The master > > daemon reads/writes messages to/from kernel > > space with this file descriptor. > > no arguments > > > > return value: a new file descriptor or -errno on failure. > > Shouldn't this better be a new character device? > > (Dito for the other fd stuff) It could be - it was at first. Part of avoiding stepping on too many things included not needing more magic numbers for character device nodes. Since the syscall was already there, it seemed convenient to get the FDs though that instead of a device node. I considered making it a type of socket but that seemed gross. Instead I looked at the pipe() code doing something similar to that. > > 0x0201 - BPROC_SYS_INFO - get information on node status > > arg1 pointer to an array of bproc_node_info_t's (first element contains > > index of last node seen, nodes returned will start with next node) > > arg2 number of elements in array > > > > return value: number of elements returned on success, -errno on error > > This should be read() on a special file. > > > > > 0x0202 - BPROC_SYS_STATUS - set node status > > arg1 node number > > arg2 new node state > > > > return value: 0 on success, -errno on error > > Write on a special file. > > > > > 0x0203 - BPROC_SYS_CHOWN - change node permission bits (perms are file-like) > > So why is this no file, e.g. in sysfs? Two reasons: 1 - I want this call to work from anywhere in the system. This isn't a complete SSI, there's no shared filesystem. 2 - This information isn't maintained in kernel space right now. The master daemon (a normal user space process) keeps track of it and does permission checks when move requests come by. 2.5 - I haven't ported to 2.5 yet. I've been hearing good things about it so I'll probably look at it soon (i.e. when I get some time). The reason I don't like to mess around with development kernels until they stabilize is that a little instability on 1 machine might be tolerable - on a 1024 node machine, it's really not. > > 0x0207 - BPROC_SYS_CHROOT - ask slave node to chroot() > > arg1 node number > > arg2 pointer to patch to chroot() to. > > Please explain this a bit more. Can't you use namespace properly on > the slaves somehow? There's some legacy here - this stuff pre-dates pivot_root. I should say a little more about the kind of environment the slaves operate in. The slave nodes are very bare-bones (this makes them reliable). They are diskless, they boot linux + initrd out of flash on the board and the slave node gets running with an essentially blank file system. The less software you install on a node, the more reliable it is. The problem was that we needed something a little more featureful as the root file system. The way around this was to mount whatever we wanted to use at the root fs and then have the slave daemon chroot to it. Processes that migrated to the node after that would see the root file system we wanted. > > 0x0208 - BPROC_SYS_REBOOT - ask slave node to reboot > > arg1 node number > > > > return value: 0 on success, -errno on error > > > > 0x0209 - BPROC_SYS_HALT - ask slave node to halt > > arg1 node number > > > > return value: 0 on success, -errno on error > > > > 0x020A - BPROC_SYS_PWROFF - ask slave node to power off > > arg1 node number > > > > return value: 0 on success, -errno on error > > Can't you just call sys_reboot on the remote node? If you just call reboot on the node w/o getting BProc in the loop, then it wouldn't cleanly disconnect before rebooting. That node would appear hung to the rest of the system until the master decided that it was dead. That's done with a simple timeout right now. Since our nodes are running *nothing* but the Bproc slave, you can't log in some other way to kill the slave and then reboot and you can't run shutdown -r or something like that becuase there are no init scripts. > > 0x020B - BPROC_SYS_PINFO - get information about location of remote processes > > arg1 pointer to an array of bproc_proc_info_t's (first element contains > > index of last proc seen, procs returned will start with next node) > > arg2 number of elements in array > > > > return value: number of elements returned on success, -errno on error > > Should be read() on a special file. It started life like that but then I liked the idea of being able to do it from any node in the system. (remember no shared fs) > > 0x020E - BPROC_SYS_RECONNECT - ask slave daemon to reconnect > > arg1 node number > > arg2 pointer to bproc_connect_t which contains 2 sockaddrs - a local > > and remote address for the slave to use when re-connecting to > > the master. > > Don't use bproc_connect_t but the real arguments. > > > 0x0301 - BPROC_SYS_REXEC - remote exec (replace current process with exec > > performed on remote node) > > arg1 node number > > arg2 pointer to bproc_move_t (contains exec args, io setup info, etc) > > > > return value: no return (it's an exec) on success, -errno on error > > > > 0x0302 - BPROC_SYS_MOVE - move caller to remote node > > arg1 node number > > arg2 pointer to bproc_move_t (contains flags, io setup info, etc) > > > > arg2 move flags (how much of the memory space gets sent) > > > > return value: 0 on success, -errno on error > > > > 0x0303 - BPROC_SYS_RFORK - fork a child onto another node. This is a > > combination of the fork and move calls with > > semantics such that no child process is > > ever created (from the parent's point of > > view) if the move step fails. > > arg1 node number > > arg2 pointer to bproc_move_t (contains flags, io setup info, etc) > > > > return value: parent: child pid on success, -errno on error > > child: 0 > > > > 0x0304 - BPROC_SYS_EXECMOVE - exec and then move. This is a > > combination of the xec and move > > syscalls. This call performs an exec > > and then moves the resulting process > > image to a remote node before it is > > allowed to run. This is used to place > > images of programs which are not BProc > > aware on remote nodes. > > arg1 node number > > arg2 pointer to bproc_move_t (contains exec args, io setup info, etc.) > > > > return value: no return on success, -errno on error if error happens > > in exec(). If error happens during the move step the process will > > exit with errno as its status. > > > > 0x0306 - BPROC_SYS_VRFORK - vector rfork - create many child processes > > on many remote nodes efficiently. > > > > arg1 pointer to bproc_move_t. This contains the number of children > > to create, a list of nodes to move to, an array to store the > > resulting child process IDs, and possibly IO setup information. > > > > return value: parent: number of nodes or -errno on error. > > pid array contains pids or -errno for each child. > > child: rank in list of nodes (0 .. n-1) > > > > 0x0307 - BPROC_SYS_EXEC - use master node to perform exec. A process > > running on a slave node can ask it's "ghost" > > on the front end to perform an exec for it. > > The results of that exec will replace the > > process on the slave node. > > arg1 pointer to bproc_move_t (contains execve args) > > > > return value: no return on success, -errno on failure. > > > > 0x0309 - BPROC_SYS_VEXECMOVE - vector execmove - create many child > > processes on many remote nodes > > efficiently. The child process image > > is the result of the supplied execve. > > > > arg1 pointer to bproc_move_t. This contains the number of children > > to create, a list of nodes to move to, an array to store the > > resulting child process IDs, execve args and possibly IO setup > > information. > > > > return value: parent: number of nodes or -errno on failure. The array > > children: no return. If BPROC_RANK=XXXXXXX exists in the > > environment, vexecmove will replace the Xs with > > the child's rank in vexecmove. > > > > 0x1000 - at this offset, bproc provides an interface to virtual memory > > area dumper (vmadump). VMADump is the process save/restore > > mechanism that BProc uses internally. > > I think all these are pretty generic for any SSI clustering. Could > you please talk to the Compaq and Mosix folks about a common API? I think it's not quite as generic as you might think. Compaq (OpenSSI, I presume) and Mosix are mostly concerned with providing a really transparent SSI. BProc is concerned only with process creation and management - it's a partial SSI. It does nothing for a shared name space. I've always viewed the global FS as a separate problem with the answer being something like Lustre. The difference in goals means that BProc is going to opt for scalability at the cost of maintaining a perfect SSI. Not providing complete transparency (for filesystem and everything else) is a huge win for scalability. At the risk of pissing some people off.... The scalability of completely transparent SSI systems is usually measured in the 10s of nodes. DEC/Compaq/HP claimed to scale to 32 or 64 nodes the last time we talked to them here at the lab. I don't know how many remote processes a single machine can reasonably support in Mosix. We currently have a 1024 node (1 master + 1023 slaves) BProc system ("Pink") here at the lab. Since BProc doesn't provide a complete SSI, the migration API includes some extra stuff to smooth over the rough spots. For example, BProc's API includes some extrastuff for setting up I/O (stdin/out/err) for the remote process. So anyway, while there certainly is overlap, I think OpenSSI and Mosix are probably different enough that coming up with a sensible common API will be difficult. Having that many nodes also makes mass process creation much more important. I think the mass process creation primitives became important for us around 128 nodes. Now we can put a 4MB process on 1023 nodes in 0.7 sec. :) The last time we talked to DEC/Compaq/HP here at the lab, they seemed mostly uninterested in what we were doing anyway. Disclaimer: Any views expressed here are my own and not my employer's. > > 0x1000 - VMAD_DO_DUMP - dump the calling process's image to a file descriptor > > arg1 file descriptor > > arg2 dump flags (controls which regions are dumped and which regions > > are stored as references to files.) > > > > return value: during dump: number of bytes written to the file descriptor. > > during undump: 0 (when the process image is restored, it will > > start by returning from this system call) > > I'm pretty sure this would better be a /proc/<pid>/image file you > can read from. I'm a little fuzzy on what you mean here. If you're suggesting that a process read from its own /proc/pid/image, then that's hard because the process is changing while you do it. In the 3rd party case (which vmadump doesn't support) it gets more tricky because you need to make sure the process is stopped and the CPU state stored while you're reading this. The reason I like the FD interface is because of how BProc uses it. When a process migrates, BProc opens a tcp socket between the two machines, then it calls dump and undump back-to-back so you can migrate w/o any file system dependencies at all. > > 0x1001 - VMAD_DO_UNDUMP - restore a process image from a file > > descriptor. The new image will replace the > > calling process just like exec. > > arg1 file descriptor > > > > return value: no return on success, -errno on failure. > > > > side note: where possible, vmadump adds a binary format for dump files > > which allows a dump stored in a file to be executed directly. > > Can't you always use this binary format? And btw, does this checkpoint > and restore code depend on the rest of bproc? I'd love to see it even > in normal, not cluster-awaer kernels. Unfortunately no (w/o user land magic that is). On alpha, for example, you have the problem that exec doesn't save enough on the way into the syscall so you can't restore everything on the way out. This is the callee-saved register problem again. On x86 and ppc it works fine. You could put some alpha magic in the user library to work around this. I haven't bothered since I never use this feature myself. I always thought of it as a curiosity that wasn't all that useful. I also don't use the binary format when undump'ing in a process migration because that would require writing out the image to a file. That presumes that I have a writable file system at my disposal (this is often not the case on our clusters) and some place in mind to write it. Then I'd have to clean it up later too. It seems like a needless extra step. Keep in mind that these direct VMADump calls are made available as a convenience mostly for testing. VMADump doesn't depend on BProc at all. You will, however, need a system call for it the way it's written now :) > > 0x1030 - VMAD_LIB_CLEAR - clear the library list > > no arguments > > What library lists are all those calls about? Needs more explanation. If you look at the virtual memory space of a dynamically linked program, the percentage of space used by the program itself (i.e. not libraries) is often very small. In an effort to make process migration really cheap, we're willing to say that files X, Y and Z are available on the machine where we'll be restoring the process image. The candidates for remote caching are, obviously, large shared libraries. So, the dumper needs to know what it can expect to find on the remote system and what it can't. That's where the library list comes in. It probably should just be called the remote file list or something. It's a gross hack where we tell the kernel code what it doesn't need to dump. Anything that isn't dumped gets stored in the dump file as a reference to a file. (e.g. map X bytes of /lib/libc-2.3.2.so @ offset Y) And yeah, this might be cleaner as a writable special file but this was easy given the big syscall mux. - Erik ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-05 20:15 ` hendriks @ 2003-04-08 19:19 ` H. Peter Anvin 2003-04-10 6:29 ` Christoph Hellwig 1 sibling, 0 replies; 11+ messages in thread From: H. Peter Anvin @ 2003-04-08 19:19 UTC (permalink / raw) To: linux-kernel Followup to: <20030405201537.GA18755@lanl.gov> By author: hendriks@lanl.gov In newsgroup: linux.dev.kernel > > The reason it is the way it is because when I'm trying to avoid > stomping on other syscalls, having a small foot print is a good thing. > > BProc will always be a fringe kind of thing. Adding more than a > syscall or two seems like quite a bit of polution in the main kernel > to me. Similarly, I don't think the main kernel should include the > BProc patch. It changes fairly often, isn't 100% unintrusive and > would be used by less than .1% of people out there. > > Breaking out every call into a separate syscall number would also make > it more difficult to add new features in the future. > Well, first of all, multiplexes break a lot of tools. But worse, they lead to really badly designed APIs partially because of lack of review. You have just demonstrated this phenomenon... -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-05 20:15 ` hendriks 2003-04-08 19:19 ` H. Peter Anvin @ 2003-04-10 6:29 ` Christoph Hellwig 2003-04-14 17:18 ` hendriks 1 sibling, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2003-04-10 6:29 UTC (permalink / raw) To: hendriks; +Cc: torvalds, linux-kernel On Sat, Apr 05, 2003 at 01:15:37PM -0700, hendriks@lanl.gov wrote: > The reason it is the way it is because when I'm trying to avoid > stomping on other syscalls, having a small foot print is a good thing. Adding more syscalls isn't really a big deal - whether you add one or a bunch of them in a diff doesn't really matter. > Breaking out every call into a separate syscall number would also make > it more difficult to add new features in the future. Which is a good thing :) Having syscall multiplexers leads to very messy APIs like the one you proposed :) > Since our nodes are running *nothing* but the Bproc slave, you can't > log in some other way to kill the slave and then reboot and you can't > run shutdown -r or something like that becuase there are no init > scripts. We have a reboot notifier call chain in the kernel. > > Should be read() on a special file. > > It started life like that but then I liked the idea of being able to > do it from any node in the system. (remember no shared fs) You have this no shared fs argument a few times - why don't you _add_ a shared virtual filesystem for kerne, information? This would clean up many of the messier APIs. > > I'm pretty sure this would better be a /proc/<pid>/image file you > > can read from. > > I'm a little fuzzy on what you mean here. If you're suggesting that a > process read from its own /proc/pid/image, then that's hard because > the process is changing while you do it. In the 3rd party case (which > vmadump doesn't support) it gets more tricky because you need to make > sure the process is stopped and the CPU state stored while you're > reading this. Okay, you're right - this should be a syscall. > VMADump doesn't depend on BProc at all. You will, however, need a > system call for it the way it's written now :) Yeah, conviencded. Care to submit a separated out vmadump aptch with the syscalls for 2.5? > > > > 0x1030 - VMAD_LIB_CLEAR - clear the library list > > > no arguments > > > > What library lists are all those calls about? Needs more explanation. > > If you look at the virtual memory space of a dynamically linked > program, the percentage of space used by the program itself (i.e. not > libraries) is often very small. In an effort to make process > migration really cheap, we're willing to say that files X, Y and Z are > available on the machine where we'll be restoring the process image. > The candidates for remote caching are, obviously, large shared > libraries. > > So, the dumper needs to know what it can expect to find on the remote > system and what it can't. That's where the library list comes in. It > probably should just be called the remote file list or something. > It's a gross hack where we tell the kernel code what it doesn't need > to dump. Anything that isn't dumped gets stored in the dump file as a > reference to a file. (e.g. map X bytes of /lib/libc-2.3.2.so @ offset > Y) > > And yeah, this might be cleaner as a writable special file but this > was easy given the big syscall mux. I don't think you really want a device for this. It's more an attribute of the mapping, so a MAP_ALWAYS_LOCAL flag to mmap sounds like the right thing. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-10 6:29 ` Christoph Hellwig @ 2003-04-14 17:18 ` hendriks 2003-04-14 17:32 ` Richard B. Johnson 0 siblings, 1 reply; 11+ messages in thread From: hendriks @ 2003-04-14 17:18 UTC (permalink / raw) To: Christoph Hellwig, torvalds, linux-kernel On Thu, Apr 10, 2003 at 07:29:10AM +0100, Christoph Hellwig wrote: > On Sat, Apr 05, 2003 at 01:15:37PM -0700, hendriks@lanl.gov wrote: > > The reason it is the way it is because when I'm trying to avoid > > stomping on other syscalls, having a small foot print is a good thing. > > Adding more syscalls isn't really a big deal - whether you add one or > a bunch of them in a diff doesn't really matter. > > > Breaking out every call into a separate syscall number would also make > > it more difficult to add new features in the future. > > Which is a good thing :) Having syscall multiplexers leads to very > messy APIs like the one you proposed :) I think this is a bit of a religious argument. I know it's easy to add syscalls, it's the question of adding a lot of goop. I'm looking for a situation for BProc that's similar to Tux's situation: It's a project which exists outside main kernel development. It's foot print in the main line kernel is as small as possible - it's got a single system call. That system call *IS* a good size mux which does a number of semi-related things. Given the non-mainstream nature of that feature, I think this is an entirely reasonable state of affairs. When it comes down to it, a system call mux is really no different than an ioctl mux, the only question is whether or not you have a file descriptor in the loop. I don't believe that this leads to messy APIs. I'm not saying that BProc's API isn't messy - it is. Some of it is ok. Having the mux allows the messy parts to exist until they are cleaned up. I do want to clean them up but that's not a trivial task. I like some of the file system ideas you've mentioned. > > Since our nodes are running *nothing* but the Bproc slave, you can't > > log in some other way to kill the slave and then reboot and you can't > > run shutdown -r or something like that becuase there are no init > > scripts. > > We have a reboot notifier call chain in the kernel. Yep. It works great for kernel stuff. Not so great for user space stuff. > > > Should be read() on a special file. > > > > It started life like that but then I liked the idea of being able to > > do it from any node in the system. (remember no shared fs) > > You have this no shared fs argument a few times - why don't you _add_ > a shared virtual filesystem for kerne, information? This would clean > up many of the messier APIs. I really like the filesystem idea for the front end but... Shared filesystems are really a can of worms. This would introduce a n^2 coherency problem that BProc has avoided so far. Still, I like the idea for a file system interface on the master. The back end might be able to do something similar - but slowly. The wrinkle is that there's no information about nodes, etc. currently stored in the kernel on any of the nodes. That's all maintianed in user space. > > VMADump doesn't depend on BProc at all. You will, however, need a > > system call for it the way it's written now :) > > Yeah, conviencded. Care to submit a separated out vmadump aptch with > the syscalls for 2.5? Sure... as soon as I get time to do a port. > > So, the dumper needs to know what it can expect to find on the remote > > system and what it can't. That's where the library list comes in. It > > probably should just be called the remote file list or something. > > It's a gross hack where we tell the kernel code what it doesn't need > > to dump. Anything that isn't dumped gets stored in the dump file as a > > reference to a file. (e.g. map X bytes of /lib/libc-2.3.2.so @ offset > > Y) > > > > And yeah, this might be cleaner as a writable special file but this > > was easy given the big syscall mux. > > I don't think you really want a device for this. It's more an attribute > of the mapping, so a MAP_ALWAYS_LOCAL flag to mmap sounds like the right > thing. That's really just shifting the responsibility to another hunk of code. Now, the dynamic linker (or any other piece of code that could potentially mmap something like that) would need to know to tag the region. I'm sure there's a cleaner way to handle this than dealing in lists of strings but involving the dynamic linker (another moving target) doesn't seem like the right answer to me. - Erik ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Syscall numbers for BProc 2003-04-14 17:18 ` hendriks @ 2003-04-14 17:32 ` Richard B. Johnson 0 siblings, 0 replies; 11+ messages in thread From: Richard B. Johnson @ 2003-04-14 17:32 UTC (permalink / raw) To: hendriks; +Cc: Christoph Hellwig, torvalds, linux-kernel FYI, these syscall numbers are "available", at least for private tests: break() = -1 ENOSYS (Function not implemented) stty(0x1, 0x1) = -1 ENOSYS (Function not implemented) gtty(0x2, 0x2) = -1 ENOSYS (Function not implemented) ftime() = -1 ENOSYS (Function not implemented) prof() = -1 ENOSYS (Function not implemented) acct(0x8048e4a) = -1 ENOSYS (Function not implemented) lock() = -1 ENOSYS (Function not implemented) mpx() = -1 ENOSYS (Function not implemented) Note that nobody can even call break() from the 'C' language. The name is reserved ;^) This is probably a good candidate for the next permanent name change. Cheers, Dick Johnson Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips). Why is the government concerned about the lunatic fringe? Think about it. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2003-04-14 17:18 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-04-04 19:32 Syscall numbers for BProc Erik Hendriks 2003-04-04 19:35 ` Christoph Hellwig 2003-04-04 20:43 ` hendriks 2003-04-04 21:54 ` H. Peter Anvin 2003-04-05 0:44 ` hendriks 2003-04-05 5:45 ` Christoph Hellwig 2003-04-05 20:15 ` hendriks 2003-04-08 19:19 ` H. Peter Anvin 2003-04-10 6:29 ` Christoph Hellwig 2003-04-14 17:18 ` hendriks 2003-04-14 17:32 ` Richard B. Johnson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox