* Re: light weight user level semaphores
@ 2001-04-19 13:59 George Talbot
0 siblings, 0 replies; 60+ messages in thread
From: George Talbot @ 2001-04-19 13:59 UTC (permalink / raw)
To: linux-kernel; +Cc: torvalds
[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]
On Tue, 17 Apr 2001 12:48:48 -0700 (PDT) Linus Torvalds wrote:
[deletia]
> /*
> * a fast semaphore is a 128-byte opaque thing,
> * aligned on a 128-byte boundary. This is partly
> * to minimize false sharing in the L1 (we assume
> * that 128-byte cache-lines are going to be fairly
> * common), but also to allow the kernel to hide
> * data there
> */
> struct fast_semaphore {
> unsigned int opaque[32];
> } __attribute__((aligned, 64));
>
> struct fast_semaphore *FS_create(char *ID);
> int FS_down(struct fast_semaphore *, unsigned long timeout);
> void FS_up(struct fast_semaphore *);
[deletia]
These are counting semaphores, right? If so, would this make sense?
struct fast_semaphore *FS_create(char *ID, int initial_count);
[FS_down/FS_up the same]
int FS_get_count(struct fast_semaphore *);
The FS_get_count() is less useful, but being able to pass the initial
count to the semaphore is pretty important.
--George
[-- Attachment #2: Card for George Talbot --]
[-- Type: text/x-vcard, Size: 316 bytes --]
begin:vcard
n:Talbot;George
tel;fax:732-615-4526
tel;work:732-615-5099
x-mozilla-html:FALSE
org:Lucent, Inc.;Inter-Networking Systems
adr:;;480 Red Hill Road, Building 1;Middletown;NJ;07748;USA
version:2.1
email;internet:gtalbot@lucent.com
title:Senior Software Engineer
x-mozilla-cpt:;0
fn:George Talbot
end:vcard
^ permalink raw reply [flat|nested] 60+ messages in thread[parent not found: <20010417114433.D1108@w-mikek2.sequent.com>]
* Re: light weight user level semaphores [not found] <20010417114433.D1108@w-mikek2.sequent.com> @ 2001-04-17 19:48 ` Linus Torvalds 2001-04-18 18:13 ` Bernd Eckenfels 2001-04-18 19:35 ` Ulrich Drepper 0 siblings, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-17 19:48 UTC (permalink / raw) To: Mike Kravetz; +Cc: Kernel Mailing List [ Cc'd to linux-kernel, to get feedback etc. I've already talked this over with some people a long time ago, but more people might get interested ] On Tue, 17 Apr 2001, Mike Kravetz wrote: > > In the near future, I should have some time to begin > working on a prototype implementation. One thing that > I don't remember too clearly is a reference you made to > the System V semaphore implementation. I'm pretty sure > you indicated any new light weight implementation should > not be based on the System V APIs. Is this correct, or > did I remember incorrectly? It's correct. I don't see any way the kernel can do the SysV semantics for "cleanup" for a semaphore when a process dies in an uncontrolled manner (or do it fast enough even when it can use at_exit() etc). The whole point of fast semaphores would be to avoid the kernel entry entirely for the non-contention case, which basically means that the kernel doesn't even _know_ who holds the semaphore at any given moment. So the kernel cannot do the cleanups on process exit that are part of the SysV semantics. My personal absolute favourite "fast semaphore" implementation is as follows. First the user interface, just to make it clear that the implementation is very far from the interface: /* * a fast semaphore is a 128-byte opaque thing, * aligned on a 128-byte boundary. This is partly * to minimize false sharing in the L1 (we assume * that 128-byte cache-lines are going to be fairly * common), but also to allow the kernel to hide * data there */ struct fast_semaphore { unsigned int opaque[32]; } __attribute__((aligned, 64)); struct fast_semaphore *FS_create(char *ID); int FS_down(struct fast_semaphore *, unsigned long timeout); void FS_up(struct fast_semaphore *); would basically be the interface. People would not need to know what the implementation is like. Add to taste (ie make rw-semaphores, etc), but the above is a kind of "fairly minimal thing". So "trydown()" would just be a FS_down() with a zero timeout, for example. Anyway, the implementation would be roughly: - FS_create is responsible for allocating a shared memory region at "FS_create()" time. This is what the ID is there for: a "anonymous" semaphore would have an ID of NULL, and could only be used by threads or across a fork(): it would basically be done with a MAP_ANON | MAP_SHARED, and the pointer returned would just be a pointer to that memory. So FS_create() starts out by allocating the backing store for the semaphore. This can basically be done in user space, although the kernel does need to get involved for the second part of it, which is to (a) allocate a kernel "backing store" thing that contains the waiters and the wait-queues for other processes and (b) fill in the opaque 128-bit area with the initial count AND the magic to make it fly. More on the magic later. So the second part of FS_create needs a new system call. - FS_down() and FS_up() would be two parts: the fast case (no contention), very similar to what the Linux kernel itself uses. And the slow case (contention), which ends up being a system call. You'd have something like this on x86 in user space: extern void FS_down(struct fast_semahore *fs, unsigned long timeout) __attribute__((regparm(3))); /* Four-instruction fast-path: the call plus these ones */ FS_down: lock ; decl (%edx) js FS_down_contention ret FS_down_contention: movl $FS_down_contention_syscall,%eax int 80 ret (Note: the regparm(3) thing makes the arguments be passed in %edx and %ecx - check me on details in which order, and realize that they will show up as arguments to the system call too because the x86 system call interface is already register-based) FS_up() does the same - see how the kernel already knows to avoid doing the wakup if there has been no contention, and has a fast-path that never goes out-of-line (ie the kernel semaphore out-of-line case is the user-level system call case). So now we get to the "subtle" part. Getting contention right. The above causes us to get to the kernel when we have contention, and the kernel gets only a pointer to user space. In particular, it gets a pointer to memory that it cannot trust, and from that _untrusted_ pointer it needs to quickly get to the _trusted_ part, ie the part that only the kernel itself controls (the stuff with the wait-queues etc). This is where subtlety is needed. The speed concerns are paramount: I am convinced that the non-contention case is the important one, but at the same time we can't allow contention to be _too_ costly either. The system call is fairly cheap (and already acts as a first-level back-off, so that's ok), but we can't afford to spend more time than we need here. So in my opinion the only reasonable approach is to have a kernel pointer in the untrusted memory, and then have ways to quickyl validate the pointer. My preferred approach: - the first word of the "opaque" semaphore is obviously the semaphore count (we already used it that way in the user-space thing). - the second word of the semaphore is the pointer to kernel space that was set up at kernel portion of FS_create. - an arbitrary part (say 256 bits) of the rest of the semaphore are a secure hash that the kernel did at FS_create time. The validation boils down to: unsigned long FS_down_system_call( unsigned long unused, /* %ebx */ unsigned long timeout, /* %ecx */ struct fast_semaphore *fs) /* %edx */ { struct kernel_fast_sem *kfs; if ((unsigned long) fs & 127) goto bad_sem; if (!access_ok(VERIFY_READ, fs, 128)) goto bad_sem; /* * See if the system call already caused * us to become un-contended. We don't need * the kernel pointer for this, and thus * we don't need the verification overhead. */ if (FS_trydown(fs)) return 0; kfs = __get_user(fs->opaque+1); /* * Verify that it might be a valid kernel pointer * before we even try to dereference it */ if ((unsigned long) kfs & 7) goto bad_sem; if (kfs < TASK_SIZE) goto bad_sem; if (kfs > TASK_SIZE+640k && kfs < TASK_SIZE + 1M) goto bad_sem; if (kfs > high_mem) goto bad_sem; /* * Simple first-level check, so that user space * cannot just try to make the signature match * whatever is in kernel memory at the time. There * are some common kernel patterns (like all zero), * which might otherwise allow users to pass in a * bogus kernel pointer. */ if (kfs->magic != FS_SIGNATURE_MAGIC) goto bad_sem; if (kfs->user_address != fs) goto bad_sem; /* * Ok, we know we can dereference it, and that it _looks_ * like a valid semaphore. Make sure by verify secure * signature */ for (i = 0; i < FS_SIGNATURE_WORDS; i++) if (__get_user(fs->opaque+2+i) != kfs->signature[i]) goto bad_sem; /* * Ok, we now have the counter (in user space in "fs") * and the kernel part (wait queues, waiter info etc). * Do the slow path, return success/failure. return do_fs_down(fs, kfs, timeout); bad_sem: /* * EXIT. Don't let the process try billions of bad * combinations fast. Make him fork() for each one. */ do_exit(11); } See? The only important part is that when you create the fast semaphore in FS_create() (and that is going to be the slow part), the signature has to be a cryptographically secure random number so that user space cannot spoof kernel pointers. So the overhead for the above is - non-contention: zero overhead (but semaphore creation is not free) - contention: kernel entry (unavoidable anyway) verification The verification boils down to a few range checks and a (cached - we've already looked at, or will need to look at, the other fields in the same structures) memcmp(), so the overhead there is on the order of 30 cycles. Security issues: - the user could create a non-shared user-mode "fs" pointer that has the right signature, and thus fool the kernel into using the wrong user-mode pointer. Note that this is OK. The kernel won't mess up its own integrity, it will just get the wrong answer. Who cares if the kernel allows multiple users to enter if they are bad users? - The user must _not_ be able to fool the kernel into using a completely non-existing semaphore. Comments? Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-17 19:48 ` Linus Torvalds @ 2001-04-18 18:13 ` Bernd Eckenfels 2001-04-18 19:35 ` Ulrich Drepper 1 sibling, 0 replies; 60+ messages in thread From: Bernd Eckenfels @ 2001-04-18 18:13 UTC (permalink / raw) To: linux-kernel In article <Pine.LNX.4.31.0104171200220.933-100000@penguin.transmeta.com> you wrote: > So FS_create() starts out by allocating the backing store for the > semaphore. This can basically be done in user space, although the > kernel does need to get involved for the second part of it, which > is to (a) allocate a kernel "backing store" thing that contains the > waiters and the wait-queues for other processes and (b) fill in the > opaque 128-bit area with the initial count AND the magic to make it > fly. More on the magic later. > So the second part of FS_create needs a new system call. How will the clean up of the kernelstore work? > - The user must _not_ be able to fool the kernel into using a completely > non-existing semaphore. In that case the access to kernel level is protected by a very secure combination of secure hash and magic number checking. But anyway there is a small chance to get to some kernel memory unauthorized. Do you know if this is the first (known) interface which has a more practical approach to kernel data structure security? If we want to be a bit more strict, we can have a pre-allocated pool of semaphores and the kernel pointer check can add the kernelk address of the semaphore region into account. It's faster than the checksum probably and more secure in protecting the rest of the kernel memory. Spoofing access to other semaphores would be still possible (but can be protected by a smaller hash). Greetings Bernd ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-17 19:48 ` Linus Torvalds 2001-04-18 18:13 ` Bernd Eckenfels @ 2001-04-18 19:35 ` Ulrich Drepper 2001-04-19 8:20 ` Alon Ziv 1 sibling, 1 reply; 60+ messages in thread From: Ulrich Drepper @ 2001-04-18 19:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: Mike Kravetz, Kernel Mailing List Linus Torvalds <torvalds@transmeta.com> writes: Sounds good so far. Some comments. > - FS_create is responsible for allocating a shared memory region > at "FS_create()" time. This is not so great. The POSIX shared semaphores require that an pthread_mutex_t object placed in a shared memory region can be initialized to work across process boundaries. I.e., the FS_create function would actually be FS_init. There is no problem with the kernel or the helper code at user level allocating more storage (for the waitlist of whatever) but it must not be necessary for the user to know about them and place them in share memory themselves. The situation for non-shared (i.e. intra-process) semaphores are easier. What I didn't understand is your remark about fork. The semaphores should be cloned. Unless the shared flag is set there should be no sharing among processes. The rest seems OK. Thanks, -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-18 19:35 ` Ulrich Drepper @ 2001-04-19 8:20 ` Alon Ziv 2001-04-19 8:52 ` Abramo Bagnara ` (3 more replies) 0 siblings, 4 replies; 60+ messages in thread From: Alon Ziv @ 2001-04-19 8:20 UTC (permalink / raw) To: Kernel Mailing List; +Cc: Mike Kravetz, Ulrich Drepper, Linus Torvalds Hmm... I already started (long ago, and abandoned since due to lack of time :-( ) down another path; I'd like to resurrect it... My lightweight-semaphores were actually even simpler in userspace: * the userspace struct was just a signed count and a file handle. * Uncontended case is exactly like Linus' version (i.e., down() is decl + js, up() is incl()). * The contention syscall was (in my implementation) an ioctl on the FH; the FH was a special one, from a private syscall (although with the new VFS I'd have written it as just another specialized FS, or even referred into the SysVsem FS). So, there is no chance for user corruption of kernel data (as it just ain't there...); and the contended-case cost is probably equivalent (VFS cost vs. validation). Hope I inspired someone... -az ----- Original Message ----- From: "Ulrich Drepper" <drepper@redhat.com> To: "Linus Torvalds" <torvalds@transmeta.com> Cc: "Mike Kravetz" <mkravetz@sequent.com>; "Kernel Mailing List" <linux-kernel@vger.kernel.org> Sent: Wednesday, April 18, 2001 21:35 Subject: Re: light weight user level semaphores > Linus Torvalds <torvalds@transmeta.com> writes: > > Sounds good so far. Some comments. > > > - FS_create is responsible for allocating a shared memory region > > at "FS_create()" time. > > This is not so great. The POSIX shared semaphores require that an > pthread_mutex_t object placed in a shared memory region can be > initialized to work across process boundaries. I.e., the FS_create > function would actually be FS_init. There is no problem with the > kernel or the helper code at user level allocating more storage (for > the waitlist of whatever) but it must not be necessary for the user to > know about them and place them in share memory themselves. > > The situation for non-shared (i.e. intra-process) semaphores are > easier. What I didn't understand is your remark about fork. The > semaphores should be cloned. Unless the shared flag is set there > should be no sharing among processes. > > > The rest seems OK. Thanks, > > -- > ---------------. ,-. 1325 Chesapeake Terrace > Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA > Red Hat `--' drepper at redhat.com `------------------------ > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 8:20 ` Alon Ziv @ 2001-04-19 8:52 ` Abramo Bagnara 2001-04-19 9:08 ` Alexander Viro 2001-04-19 16:11 ` Linus Torvalds 2001-04-19 9:08 ` Ingo Oeser ` (2 subsequent siblings) 3 siblings, 2 replies; 60+ messages in thread From: Abramo Bagnara @ 2001-04-19 8:52 UTC (permalink / raw) To: Alon Ziv Cc: Kernel Mailing List, Mike Kravetz, Ulrich Drepper, Linus Torvalds Alon Ziv wrote: > > Hmm... > I already started (long ago, and abandoned since due to lack of time :-( ) > down another path; I'd like to resurrect it... > > My lightweight-semaphores were actually even simpler in userspace: > * the userspace struct was just a signed count and a file handle. > * Uncontended case is exactly like Linus' version (i.e., down() is decl + > js, up() is incl()). > * The contention syscall was (in my implementation) an ioctl on the FH; the > FH was a special one, from a private syscall (although with the new VFS I'd > have written it as just another specialized FS, or even referred into the > SysVsem FS). > > So, there is no chance for user corruption of kernel data (as it just ain't > there...); and the contended-case cost is probably equivalent (VFS cost vs. > validation). This would also permit: - to have poll() - to use mmap() to obtain the userspace area It would become something very near to sacred Unix dogmas ;-) -- Abramo Bagnara mailto:abramo@alsa-project.org Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ALSA project http://www.alsa-project.org It sounds good! ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 8:52 ` Abramo Bagnara @ 2001-04-19 9:08 ` Alexander Viro 2001-04-19 10:44 ` Abramo Bagnara 2001-04-19 16:11 ` Linus Torvalds 1 sibling, 1 reply; 60+ messages in thread From: Alexander Viro @ 2001-04-19 9:08 UTC (permalink / raw) To: Abramo Bagnara Cc: Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper, Linus Torvalds On Thu, 19 Apr 2001, Abramo Bagnara wrote: > Alon Ziv wrote: > > > > Hmm... > > I already started (long ago, and abandoned since due to lack of time :-( ) > > down another path; I'd like to resurrect it... > > > > My lightweight-semaphores were actually even simpler in userspace: > > * the userspace struct was just a signed count and a file handle. > > * Uncontended case is exactly like Linus' version (i.e., down() is decl + > > js, up() is incl()). > > * The contention syscall was (in my implementation) an ioctl on the FH; the > > FH was a special one, from a private syscall (although with the new VFS I'd > > have written it as just another specialized FS, or even referred into the > > SysVsem FS). > > > > So, there is no chance for user corruption of kernel data (as it just ain't > > there...); and the contended-case cost is probably equivalent (VFS cost vs. > > validation). > > This would also permit: > - to have poll() > - to use mmap() to obtain the userspace area > > It would become something very near to sacred Unix dogmas ;-) I suspect that simple pipe with would be sufficient to handle contention case - nothing fancy needed (read when you need to block, write upon up() when you have contenders) Would something along the lines of (inline as needed, etc.) down: lock decl count js __down_failed down_done: ret up: lock incl count jle __up_waking up_done: ret __down_failed: call down_failed jmp down_done __up_waking: call up_waking jmp up_done down_failed() { read(pipe_fd, &dummy, 1); } up_waking() { write(pipe_fd, &dummy, 1); } be enough? Al ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 9:08 ` Alexander Viro @ 2001-04-19 10:44 ` Abramo Bagnara 0 siblings, 0 replies; 60+ messages in thread From: Abramo Bagnara @ 2001-04-19 10:44 UTC (permalink / raw) To: Alexander Viro Cc: Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper, Linus Torvalds Alexander Viro wrote: > > I suspect that simple pipe with would be sufficient to handle contention > case - nothing fancy needed (read when you need to block, write upon up() > when you have contenders) > > Would something along the lines of (inline as needed, etc.) > > down: > lock decl count > js __down_failed > down_done: > ret > > up: > lock incl count > jle __up_waking > up_done: > ret > > __down_failed: > call down_failed > jmp down_done > __up_waking: > call up_waking > jmp up_done > > down_failed() > { > read(pipe_fd, &dummy, 1); > } > > up_waking() > { > write(pipe_fd, &dummy, 1); > } > > be enough? There is something wonderful in this simple solution. However I've a few doubts: - choice policy for thread to wake is not selectable - we separate shared memory area from file descriptor - the implementation of down_try has neither been discussed nor excluded, but I don't see how to implement it The implementation of a specific filesystem seems to me more flexyble. -- Abramo Bagnara mailto:abramo@alsa-project.org Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ALSA project http://www.alsa-project.org It sounds good! ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 8:52 ` Abramo Bagnara 2001-04-19 9:08 ` Alexander Viro @ 2001-04-19 16:11 ` Linus Torvalds 2001-04-19 16:33 ` Alexander Viro ` (2 more replies) 1 sibling, 3 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 16:11 UTC (permalink / raw) To: Abramo Bagnara Cc: Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Abramo Bagnara wrote: > > > [ Using file descriptors ] > > This would also permit: > - to have poll() > - to use mmap() to obtain the userspace area > > It would become something very near to sacred Unix dogmas ;-) No, this is NOT what the UNIX dogmas are all about. When UNIX says "everything is a file", it really means that "everything is a stream of bytes". Things like magic operations on file desciptors are _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic semantics of file descriptors is NOT Unix dogma at all, it is a horrible corruption of the original UNIX cleanlyness. Please don't excuse "semaphore file descriptors" with the "everything is a file" mantra. It is not at ALL applicable. The "everything is a file" mantra is to make pipe etc meaningful - processes don't have to worry about whether the fd they have is from a file open, a pipe() system call, opening a special block device, or a socket()+connect() thing. They can just read and write. THAT is what UNIX is all about. And this is obviously NOT true of a "magic file descriptors for semaphores". You can't pass it off as stdin to another process and expect anything useful from it unless the other process _knows_ it is a special semaphore thing and does mmap magic or something. The greatness of UNIX comes from "everything is a stream of bytes". That's something that almost nobody got right before UNIX. Remember VMS structured files? Did anybody ever realize what an absolutely _idiotic_ crock the NT "CopyFile()" thing is for the same reason? Don't confuse that with "everything should be a file descriptor". The two have nothing to do with each other. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:11 ` Linus Torvalds @ 2001-04-19 16:33 ` Alexander Viro 2001-04-19 16:43 ` Linus Torvalds 2001-04-19 16:43 ` Abramo Bagnara 2001-04-19 20:47 ` Ingo Oeser 2 siblings, 1 reply; 60+ messages in thread From: Alexander Viro @ 2001-04-19 16:33 UTC (permalink / raw) To: Linus Torvalds Cc: Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Linus Torvalds wrote: > > > On Thu, 19 Apr 2001, Abramo Bagnara wrote: > > > > > [ Using file descriptors ] > > > > This would also permit: > > - to have poll() > > - to use mmap() to obtain the userspace area > > > > It would become something very near to sacred Unix dogmas ;-) > > No, this is NOT what the UNIX dogmas are all about. > > When UNIX says "everything is a file", it really means that "everything is > a stream of bytes". Things like magic operations on file desciptors are > _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic > semantics of file descriptors is NOT Unix dogma at all, it is a horrible > corruption of the original UNIX cleanlyness. <applause> The only reason for using file descriptors is that we can (AFAICS) avoid any magic operations or new kinds of files. Honest-to-$DEITY read() and write() on real pipes seems to be enough to implement contention case for simple semaphores. I see your point re sequential allocation of descriptors, but I'm not sure that it's that serious - we need that stuff only for multi-threaded programs and in that case we can't rely on sequentially allocated descriptors anyway - stuff from different threads gets mixed together. I certainly agree that introducing ioctl() in _any_ API is a shootable offense. However, I wonder whether we really need any kernel changes at all. Al ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:33 ` Alexander Viro @ 2001-04-19 16:43 ` Linus Torvalds 2001-04-19 17:33 ` Alexander Viro 0 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 16:43 UTC (permalink / raw) To: Alexander Viro Cc: Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Alexander Viro wrote: > > I certainly agree that introducing ioctl() in _any_ API is a shootable > offense. However, I wonder whether we really need any kernel changes > at all. I'd certainly be interested in seeing the pipe-based approach. Especially if you make the pipe allocation lazy. That isn'tr trivial (it needs to be done right with both up_failed() and down_failed() trying to allocate the pipe on contention and using an atomic cmpxchg-style setting if none existed before). It has the BIG advantage of working on old kernels, so that you don't need to have backwards compatibility cruft in the libraries. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:43 ` Linus Torvalds @ 2001-04-19 17:33 ` Alexander Viro 2001-04-19 17:38 ` Linus Torvalds 0 siblings, 1 reply; 60+ messages in thread From: Alexander Viro @ 2001-04-19 17:33 UTC (permalink / raw) To: Linus Torvalds Cc: Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Linus Torvalds wrote: > > > On Thu, 19 Apr 2001, Alexander Viro wrote: > > > > I certainly agree that introducing ioctl() in _any_ API is a shootable > > offense. However, I wonder whether we really need any kernel changes > > at all. > > I'd certainly be interested in seeing the pipe-based approach. Especially > if you make the pipe allocation lazy. That isn'tr trivial (it needs to be > done right with both up_failed() and down_failed() trying to allocate the > pipe on contention and using an atomic cmpxchg-style setting if none > existed before). It has the BIG advantage of working on old kernels, so > that you don't need to have backwards compatibility cruft in the > libraries. Ehh... Non-lazy variant is just read() and write() as down_failed() and up_wakeup() Lazy... How about if (Lock <= 1) goto must_open; opened: /* as in non-lazy case */ must_open: pipe(fd); lock decl Lock jg lost_it /* Already seriously positive - clean up and go */ jl spin_and_lose /* Lock went from 1 to 0 - go ahead */ reader = fd[0]; writer = fd[1]; Lock = MAX_INT; goto opened; spin_and_lose: /* Won't take long - another guy got to do 3 memory writes */ while (Lock <= 0) ; lost_it: lock incl Lock close(fd[0]); close(fd[1]); goto opened; Al ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 17:33 ` Alexander Viro @ 2001-04-19 17:38 ` Linus Torvalds 2001-04-19 18:24 ` Alexander Viro 2001-04-19 19:26 ` Ulrich Drepper 0 siblings, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 17:38 UTC (permalink / raw) To: Alexander Viro Cc: Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Alexander Viro wrote: > > Ehh... Non-lazy variant is just read() and write() as down_failed() and > up_wakeup() Lazy... How about Looks good to me. Anybody want to try this out and test some benchmarks? There may be problems with large numbers of semaphores, but hopefully that won't be an issue. And the ability to select/poll on these things might come in handy for various implementation issues (ie locks with timeouts etc). Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 17:38 ` Linus Torvalds @ 2001-04-19 18:24 ` Alexander Viro 2001-04-19 19:26 ` Ulrich Drepper 1 sibling, 0 replies; 60+ messages in thread From: Alexander Viro @ 2001-04-19 18:24 UTC (permalink / raw) To: Linus Torvalds Cc: Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Linus Torvalds wrote: > > > On Thu, 19 Apr 2001, Alexander Viro wrote: > > > > Ehh... Non-lazy variant is just read() and write() as down_failed() and > > up_wakeup() Lazy... How about > > Looks good to me. Anybody want to try this out and test some benchmarks? Ugh. It doesn't look good for me. s/MAX_INT/MAX_INT>>1/ or we will get into trouble on anything that goes into spin_and_lose. Window is pretty narrow (notice that lost_it is OK - we only need to worry about somebody coming in after winner drives Lock from 1 to 0 and before it gets it from 0 to MAX_INT), but we can get into serious trouble if schedule() will hit that window. MAX_INT/2 should be enough to deal with that, AFAICS. However, I would _really_ like to get that code reviewed from the memory access ordering POV. Warning: right now I'm half-asleep, so the thing can very well be completely bogus in that area. Extra eyes would be certainly welcome. Al PS: ->Lock should be set to 1 when we initialize semaphore. Destroying semaphore should do if (sem->Lock > 1) { close(sem->writer); close(sem->reader); } ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 17:38 ` Linus Torvalds 2001-04-19 18:24 ` Alexander Viro @ 2001-04-19 19:26 ` Ulrich Drepper 2001-04-19 19:35 ` Alan Cox ` (2 more replies) 1 sibling, 3 replies; 60+ messages in thread From: Ulrich Drepper @ 2001-04-19 19:26 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz Linus Torvalds <torvalds@transmeta.com> writes: > Looks good to me. Anybody want to try this out and test some benchmarks? I fail to see how this works across processes. How can you generate a file descriptor for this pipe in a second process which simply shares some memory with the first one? The first process is passive: no file descriptor passing must be necessary. How these things are working elsewhere is that a memory address (probably a physical address) is used as a token. The semaphore object is placed in the memory shared by the processes and the virtual address is passed in the syscall. Note that semaphores need not always be shared between processes. This is a property the user has to choose. So the implementation can be easier in the normal intra-process case. In any case all kinds of user-level operations are possible as well and all the schemes suggested for dealing with the common case without syscalls can be applied here as well. -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 19:26 ` Ulrich Drepper @ 2001-04-19 19:35 ` Alan Cox 2001-04-19 20:06 ` Ulrich Drepper 2001-04-19 20:22 ` Ingo Oeser 2001-04-19 20:49 ` Linus Torvalds 2 siblings, 1 reply; 60+ messages in thread From: Alan Cox @ 2001-04-19 19:35 UTC (permalink / raw) To: drepper Cc: Linus Torvalds, Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz > I fail to see how this works across processes. How can you generate a > file descriptor for this pipe in a second process which simply shares > some memory with the first one? The first process is passive: no file > descriptor passing must be necessary. mknod foo p. Or use sockets (although AF_UNIX sockets are higher latency) Thats why I suggested using flock - its name based. Whether you mkstemp() stuff and pass it around isnt something I care about Files give you permissions for free too > Note that semaphores need not always be shared between processes. > This is a property the user has to choose. So the implementation can > be easier in the normal intra-process case. So you have unix file permissions on them ? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 19:35 ` Alan Cox @ 2001-04-19 20:06 ` Ulrich Drepper 2001-04-19 20:11 ` Alan Cox 0 siblings, 1 reply; 60+ messages in thread From: Ulrich Drepper @ 2001-04-19 20:06 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > mknod foo p. Or use sockets (although AF_UNIX sockets are higher latency) > Thats why I suggested using flock - its name based. Whether you mkstemp() > stuff and pass it around isnt something I care about > > Files give you permissions for free too I don't want nor need file permissions. A program would look like this: process 1: fd = open("somefile") addr = mmap(fd); pthread_mutexattr_init(&attr); pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED); pthread_mutex_init ((pthread_mutex_t *) addr, &attr); pthread_mutex_lock ((pthread_mutex_t *) addr); pthread_mutex_destroy((pthread_mutex_t *) addr); process 2: fd = open("somefile") addr = mmap(fd); pthread_mutex_lock ((pthread_mutex_t *) addr); The shared mem segment can be retrieved in whatever way. The mutex in this case is anonymous. Everybody who has access to the shared mem *must* have access to the mutex. For semaphores it looks similarly. First the anonymous case: process 1: fd = open("somefile") addr = mmap(fd); sem_init ((sem_t *) addr, 1, 10); // 10 is arbitrary sem_wait ((sem_t *) addr); sem_destroy((sem_t *) addr); process 2: fd = open("somefile") addr = mmap(fd); sem_wait ((sem_t *) addr); Note that POSIX semaphores could be implemented with global POSIX mutexes. Finally, named semaphores: semp = sem_open("somefile", O_CREAT|O_EXCL, 0600) sem_wait (semp); sem_close(semp); sem_unlink(semp); This is the only semaphore kind which maps nicely to a pipe or socket. All the others don't. And even for named semaphores it is best to have a separate name space like the shmfs. > So you have unix file permissions on them ? See above. Permissions are only allowed for named semaphores. -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 20:06 ` Ulrich Drepper @ 2001-04-19 20:11 ` Alan Cox 2001-04-19 20:26 ` Ulrich Drepper 0 siblings, 1 reply; 60+ messages in thread From: Alan Cox @ 2001-04-19 20:11 UTC (permalink / raw) To: drepper Cc: Alan Cox, Linus Torvalds, Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz > I don't want nor need file permissions. A program would look like this: Your example opens/mmaps so has file permissions. Which is what I was asking > The shared mem segment can be retrieved in whatever way. The mutex in > this case is anonymous. Everybody who has access to the shared mem > *must* have access to the mutex. We agree 8) ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 20:11 ` Alan Cox @ 2001-04-19 20:26 ` Ulrich Drepper 0 siblings, 0 replies; 60+ messages in thread From: Ulrich Drepper @ 2001-04-19 20:26 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > > I don't want nor need file permissions. A program would look like this: > > Your example opens/mmaps so has file permissions. Which is what I was asking There are no permissions on the mutex object. It is the shared memory which counts. If you would implement the global mutexes as independent objects in the filesystem hierarchy you would somehow magically make the permissions match those of the object containing the memory representation of the global semaphore. fd = open("somefile", O_CREAT|O_TRUNC, 0666) addr=mmap(fd) // assume attr is for a global mutex pthread_mutex_init((pthread_mutex_t*)addr, &attr) fchmod(fd, 0600) fchown(fd, someuser, somegroup) If pthread_mutex_attr() is allocating some kind of file, how do you determine the permissions? How are they changed if the permissions to the file change? The kernel representation of the mutex must not be disassociated from the shared memory region. Even if you all think very little about Solaris, look at the kernel interface for semaphores. -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 19:26 ` Ulrich Drepper 2001-04-19 19:35 ` Alan Cox @ 2001-04-19 20:22 ` Ingo Oeser 2001-04-19 20:40 ` Ulrich Drepper ` (2 more replies) 2001-04-19 20:49 ` Linus Torvalds 2 siblings, 3 replies; 60+ messages in thread From: Ingo Oeser @ 2001-04-19 20:22 UTC (permalink / raw) To: Ulrich Drepper; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Apr 19, 2001 at 12:26:03PM -0700, Ulrich Drepper wrote: > In any case all kinds of user-level operations are possible as well > and all the schemes suggested for dealing with the common case without > syscalls can be applied here as well. Are you sure, you can implement SMP-safe, atomic operations (which you need for all up()/down() in user space) WITHOUT using privileged instructions on ALL archs Linux supports? How do we do this on nccNUMA machines later? How on clusters[1]? On what I can see in asm-*/atomic.h this is not possible, but I probably miss sth. here ;-) I didn't know that POSIX forbids using fds to implement a semaphore. That's VERY bad. Learning new APIs always means making a lot of mistakes and doing this while we write production code, since nobody likes to pay for experiments. And I still see no point on speeding of creation and contention, since these should be rare cases and the application overusing these should be punished HARD. Maybe someone can enlighten my on these aspects. Regards Ingo Oeser [1] Ok, people already use other than Unix mechanisms for this stuff on massive parallel computing. So this might not be an issue. Only for libc internal sema4s -- 10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag> <<<<<<<<<<<< been there and had much fun >>>>>>>>>>>> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 20:22 ` Ingo Oeser @ 2001-04-19 20:40 ` Ulrich Drepper 2001-04-19 20:51 ` Linus Torvalds 2001-04-19 21:38 ` Alan Cox 2 siblings, 0 replies; 60+ messages in thread From: Ulrich Drepper @ 2001-04-19 20:40 UTC (permalink / raw) To: Ingo Oeser; +Cc: Linus Torvalds, Kernel Mailing List Ingo Oeser <ingo.oeser@informatik.tu-chemnitz.de> writes: > Are you sure, you can implement SMP-safe, atomic operations (which you need > for all up()/down() in user space) WITHOUT using privileged > instructions on ALL archs Linux supports? Which processors have no such instructions but are SMP-capable? > How do we do this on nccNUMA machines later? How on clusters[1]? Clusters are not my problem. They require additional software. And NUMA machines maybe be requiring a certain sequence in which the operations must be performed and the hardware should take care of the rest. I don't really care what the final implementation will be like. For UP and SMP machines I definitely want to have as much as possible at user-level. If you need a special libpthread for NUMA machines, so be it. -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 20:22 ` Ingo Oeser 2001-04-19 20:40 ` Ulrich Drepper @ 2001-04-19 20:51 ` Linus Torvalds 2001-04-19 21:38 ` Alan Cox 2 siblings, 0 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 20:51 UTC (permalink / raw) To: Ingo Oeser; +Cc: Ulrich Drepper, Kernel Mailing List On Thu, 19 Apr 2001, Ingo Oeser wrote: > > Are you sure, you can implement SMP-safe, atomic operations (which you need > for all up()/down() in user space) WITHOUT using privileged > instructions on ALL archs Linux supports? Why do you care? Sure, there are broken architectures out there. They'd need system calls. They'd be slow. That's THEIR problem. No sane architecture has this limitation. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 20:22 ` Ingo Oeser 2001-04-19 20:40 ` Ulrich Drepper 2001-04-19 20:51 ` Linus Torvalds @ 2001-04-19 21:38 ` Alan Cox 2 siblings, 0 replies; 60+ messages in thread From: Alan Cox @ 2001-04-19 21:38 UTC (permalink / raw) To: Ingo Oeser; +Cc: Ulrich Drepper, Linus Torvalds, Kernel Mailing List > Are you sure, you can implement SMP-safe, atomic operations (which you need > for all up()/down() in user space) WITHOUT using privileged > instructions on ALL archs Linux supports? You don't need to. For some architectures the semaphore code would always call into the kernel. For those that allow fast locks in userspace it won't. The API is the thing, and the public exposure would I assume be pthreads ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 19:26 ` Ulrich Drepper 2001-04-19 19:35 ` Alan Cox 2001-04-19 20:22 ` Ingo Oeser @ 2001-04-19 20:49 ` Linus Torvalds 2001-04-19 21:18 ` Ulrich Drepper 2 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 20:49 UTC (permalink / raw) To: Ulrich Drepper Cc: Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz On 19 Apr 2001, Ulrich Drepper wrote: > Linus Torvalds <torvalds@transmeta.com> writes: > > > Looks good to me. Anybody want to try this out and test some benchmarks? > > I fail to see how this works across processes. It's up to FS_create() to create whatever shared mapping is needed. For threads, you don't need anything special. For fork()'d helper stuff, you'd use MAP_ANON | MAP_SHARED. For execve(), you need shm shared memory or MAP_SHARED on a file. It all depends on your needs. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 20:49 ` Linus Torvalds @ 2001-04-19 21:18 ` Ulrich Drepper 2001-04-19 21:41 ` Linus Torvalds 0 siblings, 1 reply; 60+ messages in thread From: Ulrich Drepper @ 2001-04-19 21:18 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz Linus Torvalds <torvalds@transmeta.com> writes: > > I fail to see how this works across processes. > > It's up to FS_create() to create whatever shared mapping is needed. No, the point is that FS_create is *not* the one creating the shared mapping. The user is explicitly doing this her/himself. -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 21:18 ` Ulrich Drepper @ 2001-04-19 21:41 ` Linus Torvalds 2001-04-19 22:46 ` Ulrich Drepper 0 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 21:41 UTC (permalink / raw) To: Ulrich Drepper Cc: Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz On 19 Apr 2001, Ulrich Drepper wrote: > Linus Torvalds <torvalds@transmeta.com> writes: > > > > I fail to see how this works across processes. > > > > It's up to FS_create() to create whatever shared mapping is needed. > > No, the point is that FS_create is *not* the one creating the shared > mapping. The user is explicitly doing this her/himself. No. Who creates the shared mapping is _irrelevant_, because it ends up being entirely a function of what the chosen interface is. For example, quote often you want semaphores for threading purposes only, and then you don't need a shared mapping at all. So you'd use the proper interfaces for that, and for that, your "thread_semaphore()" function would just do a malloc() and initialize the memory to zero. Doing a mmap or something like that would just be stupid, because you're protecting only one VM space anyway. In other cases, you may need to have process-wide semaphores, and you'd use "process_semaphore(char *ID)" or something, which actually does a mmap() on a shared file. Or you'd have "fork_semaphore()" that creates a semaphore that is valid across forks, not not valid across execve's and cannot be passed around. So normally the user does NOT create the shared mapping himself. Normally you'd just use the "proper interface" for your needs, nothing more. Sure, you can have the option of saying "I've created this shared memory region, please make it use the generic semaphore engine code", but quite frankly I think that is a BAD IDEA. Why? Because it won't work portably across architectures anyway. You don't know what the requirements of the architecture are, so it should be done by a nice "semaphore library". NOT by the user. Remember: these semaphores are NOT a new SysV bogosity. These semaphores are a new interface, with sane performance and sane design. And you can have multiple external interfaces to the same "semaphore engine". I'm not interested in re-creating the idiocies of Sys IPC. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 21:41 ` Linus Torvalds @ 2001-04-19 22:46 ` Ulrich Drepper 2001-04-20 1:35 ` Alexander Viro 0 siblings, 1 reply; 60+ messages in thread From: Ulrich Drepper @ 2001-04-19 22:46 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz Linus Torvalds <torvalds@transmeta.com> writes: > I'm not interested in re-creating the idiocies of Sys IPC. I'm not talking about sysv semaphores (couldn't care less). And you haven't read any of the mails with examples I sent. If the new interface can be useful for anything it must allow to implement process-shared POSIX mutexes. The user-level representation of these mutexes are simple variables which in the case of inter-process mutexes are placed in shared memory. These variables must be usable with the normal pthread_mutex_lock() functions and perform whatever is needed. Whether the pthread_mutex_init() function for shared mutexes is doing a lot more work and allocates even more memory, I don't care. The standard certainly permits this and every pthread_mutex_init() must have a pthread_mutex_destroy() which allows allocating and freeing resources (no file descriptor, though). So, yes, your FS_create syscall can allocate something. But the question is what handle to put in the pthread_mutex_t variable so the different processes can use the mutex. It cannot be a file descriptor since it's not shared between processes. It cannot be a pointer to some other place in the virtual memory since the place pointed to might not be (and probably isn't if FS_create is allocating something in the process setting up the mutex). You could put some magic cookie in the pthread_mutex_t object the kernel can then use. So, instead of repeating over and over again the same old story, fill in the gaps here: int pthread_mutex_init (pthread_mutex_t *mutex, const pthread_mutexattr_t *mutex_attr) { if (mutex_attr != NULL && mutex_attr->__pshared != 0) { ... FILL IN HERE ... } else ...intra-process mutex, uninteresting here... } int pthread_mutex_lock (pthread_mutex_t *mutex) { if (mutex_attr != NULL && mutex_attr->__pshared != 0) { ... FILL IN HERE ... } else ...intra-process mutex, uninteresting here... } int pthread_mutex_destroy (pthread_mutex_t *mutex) { if (mutex_attr != NULL && mutex_attr->__pshared != 0) { ... FILL IN HERE ... } else ...intra-process mutex, uninteresting here... } These functions must work with something like this: ~~~~~~~~~~~~~~~~~~~~~ cons.c ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> int main (int argc, char *argv[]) { char tmpl[] = "/tmp/fooXXXXXX"; int fd = mkstemp (tmpl); pthread_mutexattr_t attr; pthread_mutex_t *m1; pthread_mutex_t *m2; void *addr; volatile int *i; pthread_mutexattr_init (&attr); pthread_mutexattr_setpshared (&attr, PTHREAD_PROCESS_SHARED); ftruncate (fd, 2 * sizeof (*m1) + sizeof (int)); addr = mmap (NULL, sizeof (*m1), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); m1 = addr; m2 = m1 + 1; i = (int *) (m2 + 1); *i = 0; pthread_mutex_init (m1, &attr); pthread_mutex_lock (m1); pthread_mutex_init (m2, &attr); pthread_mutex_lock (m2); if (fork () == 0) { char buf[10]; snprintf (buf, sizeof buf, "%d", fd); execl ("./prod", "prod", buf, NULL); } while (1) { pthread_mutex_lock (m1); printf ("*i = %d\n", *i); pthread_mutex_unlock (m2); } return 0; } ~~~~~~~~~~~~~~~~~~~~~~prod.c ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> int main (int argc, char *argv[]) { int fd = atoi (argv[1]); void *addr; pthread_mutex_t *m1; pthread_mutex_t *m2; volatile int *i; addr = mmap (NULL, sizeof (*m1), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); m1 = addr; m2 = m1 + 1; i = (int *) (m2 + 1); while (1) { ++*i; pthread_mutex_unlock (m1); pthread_mutex_lock (m2); } return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 22:46 ` Ulrich Drepper @ 2001-04-20 1:35 ` Alexander Viro 2001-04-20 2:45 ` Ulrich Drepper 0 siblings, 1 reply; 60+ messages in thread From: Alexander Viro @ 2001-04-20 1:35 UTC (permalink / raw) To: Ulrich Drepper Cc: Linus Torvalds, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz On 19 Apr 2001, Ulrich Drepper wrote: > Linus Torvalds <torvalds@transmeta.com> writes: > > > I'm not interested in re-creating the idiocies of Sys IPC. > > I'm not talking about sysv semaphores (couldn't care less). And you > haven't read any of the mails with examples I sent. > > If the new interface can be useful for anything it must allow to > implement process-shared POSIX mutexes. Pardon me the bluntness, but... Why? * on _any_ UNIX we can implement semaphore (object that has Dijkstra's P and V operations, whatever) shared by processes that have access to pipe. In a portable way. That's the part of pipe semantics that had been there since way before v6. Pre-sysv, pre-POSIX, etc. When named pipes appeared the same semantics had been carried to them. Agreed so far? * if we have shared memory _and_ some implementation of semaphores we can (on architectures that allow atomic_dec() and atomic_inc()) produce semaphores that work via memory access in uncontended case and use slow semaphores to handle contention side of the business. Nothing UNIX-specific here. * such objects _are_ useful. They are reasonably portable and if they fit the task at hand and are cheaper than POSIX mutexes - that's all rationale one could need for using them. Sure, the variant I've posted was intra-process only, simply because it uses normal pipes. Implementation with named pipes is also trivial - when you map the shared area, allocate private one of the corresponding size and keep descriptors there. End of story. AFAICS mechanism is portable enough (and even on the architectures that do not allow atomic userland operations we can survive - just fall back to "slow" ones via read()/write() on pipes). And excuse me, but when one writes an application code the question is not "how to make it use POSIX semaphores", it's "how to get the serialization I need in a portable way". ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-20 1:35 ` Alexander Viro @ 2001-04-20 2:45 ` Ulrich Drepper 0 siblings, 0 replies; 60+ messages in thread From: Ulrich Drepper @ 2001-04-20 2:45 UTC (permalink / raw) To: Alexander Viro Cc: Linus Torvalds, Abramo Bagnara, Alon Ziv, Kernel Mailing List, Mike Kravetz Alexander Viro <viro@math.psu.edu> writes: > > If the new interface can be useful for anything it must allow to > > implement process-shared POSIX mutexes. > > Pardon me the bluntness, but... Why? Because otherwise there is no reason to even waste a second with this. At least for me and everybody else who has interest in portable solutions. I don't care how it's implemented. Look at the code example I posted. If you can provide an implementation which can implement anonymous inter-process mutexes then ring again. Until then I'll wait. If you implement something else I couldn't care less since it's useless for me. -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:11 ` Linus Torvalds 2001-04-19 16:33 ` Alexander Viro @ 2001-04-19 16:43 ` Abramo Bagnara 2001-04-19 20:47 ` Ingo Oeser 2 siblings, 0 replies; 60+ messages in thread From: Abramo Bagnara @ 2001-04-19 16:43 UTC (permalink / raw) To: Linus Torvalds Cc: Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper Linus Torvalds wrote: > > On Thu, 19 Apr 2001, Abramo Bagnara wrote: > > > > > [ Using file descriptors ] > > > > This would also permit: > > - to have poll() > > - to use mmap() to obtain the userspace area > > > > It would become something very near to sacred Unix dogmas ;-) > > No, this is NOT what the UNIX dogmas are all about. > > When UNIX says "everything is a file", it really means that "everything is > a stream of bytes". Things like magic operations on file desciptors are > _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic > semantics of file descriptors is NOT Unix dogma at all, it is a horrible > corruption of the original UNIX cleanlyness. Nice outpouring indeed, it seems taken from L'Ouvre au Noir by Marguerite Yourcenar ;-))) You're perfectly right but the file descriptor solution appeared to me a nice way to work around the Unix limitation to have poll(2) working only on file descriptor. Said this, I've no doubt that a better poll-like syscall would solve all that in a more elegant way. You understand that sometime we've no other choice that to design workarounds to minimize needed changes (and then often to maximize acceptance probability). OTOH you may always decide to do things in the elegant way, you've such a responsibility for linux kernel. -- Abramo Bagnara mailto:abramo@alsa-project.org Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ALSA project http://www.alsa-project.org It sounds good! ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:11 ` Linus Torvalds 2001-04-19 16:33 ` Alexander Viro 2001-04-19 16:43 ` Abramo Bagnara @ 2001-04-19 20:47 ` Ingo Oeser 2001-04-19 20:54 ` Linus Torvalds 2 siblings, 1 reply; 60+ messages in thread From: Ingo Oeser @ 2001-04-19 20:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, Apr 19, 2001 at 09:11:56AM -0700, Linus Torvalds wrote: > No, this is NOT what the UNIX dogmas are all about. > > When UNIX says "everything is a file", it really means that "everything is > a stream of bytes". Things like magic operations on file desciptors are > _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic > semantics of file descriptors is NOT Unix dogma at all, it is a horrible > corruption of the original UNIX cleanlyness. Right. And on semaphores, this stream is exactly 0 bytes long. This is perfectly normal and can be handled by all applications I'm aware of. My idea violates nothing here. > Please don't excuse "semaphore file descriptors" with the "everything is a > file" mantra. It is not at ALL applicable. > > The "everything is a file" mantra is to make pipe etc meaningful - > processes don't have to worry about whether the fd they have is from a > file open, a pipe() system call, opening a special block device, or a > socket()+connect() thing. They can just read and write. THAT is what UNIX > is all about. Right. And with my approach read() and write() with a buffer pointer != NULL would either yield an return value of "0" or -1 and set errno=EINVAL ("object not suitable for reading/writing"). Anyway they should return IMMIDIATELY in these cases. We already have these special semantics with devices. Look at /dev/sgX for an example how we pass even structured data via normal read/write (instead of "stream of bytes"). > And this is obviously NOT true of a "magic file descriptors for > semaphores". You can't pass it off as stdin to another process and expect > anything useful from it unless the other process _knows_ it is a special > semaphore thing and does mmap magic or something. see above. NOTHING special about this idea. No magic handling involved, unless the user of the fd knows what it is. For other users it will be just a normal fd with normal operations, since the special case is hidden well enough. This is even WAY simpler as all that tty-crap and similar devices, which read/write very dependend on their actual ioctl configuration. But since stupid POSIX forbids using fds for semaphores (according to Ulrich Drepper), this nice, simple and non-intrusive solution is out. Instead we should go with several new syscalls, user space dependencies, strange error handling and yet-to-discuss semantics. Everybody else byt you would have been kicked out by the core people for suggesting this ;-) Regards Ingo Oeser -- 10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag> <<<<<<<<<<<< been there and had much fun >>>>>>>>>>>> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 20:47 ` Ingo Oeser @ 2001-04-19 20:54 ` Linus Torvalds 0 siblings, 0 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 20:54 UTC (permalink / raw) To: Ingo Oeser; +Cc: linux-kernel On Thu, 19 Apr 2001, Ingo Oeser wrote: > On Thu, Apr 19, 2001 at 09:11:56AM -0700, Linus Torvalds wrote: > > No, this is NOT what the UNIX dogmas are all about. > > > > When UNIX says "everything is a file", it really means that "everything is > > a stream of bytes". Things like magic operations on file desciptors are > > _anathema_ to UNIX. ioctl() is the worst wart of UNIX. Having magic > > semantics of file descriptors is NOT Unix dogma at all, it is a horrible > > corruption of the original UNIX cleanlyness. > > Right. And on semaphores, this stream is exactly 0 bytes long. > This is perfectly normal and can be handled by all applications > I'm aware of. It's perfectly normal, but it does NOT conform to the idea "everything is a file". The fact that there are other ugly examples (ioctls and special files) does not mean that adding a new one is a good idea. When people say "everything is a file", they mean that it can be _used_ as a file, not that it can passably return a valid error code. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 8:20 ` Alon Ziv 2001-04-19 8:52 ` Abramo Bagnara @ 2001-04-19 9:08 ` Ingo Oeser 2001-04-19 11:51 ` Alan Cox 2001-04-19 16:03 ` Linus Torvalds 3 siblings, 0 replies; 60+ messages in thread From: Ingo Oeser @ 2001-04-19 9:08 UTC (permalink / raw) To: Alon Ziv Cc: Kernel Mailing List, Mike Kravetz, Ulrich Drepper, Linus Torvalds On Thu, Apr 19, 2001 at 10:20:48AM +0200, Alon Ziv wrote: > My lightweight-semaphores were actually even simpler in userspace: > * the userspace struct was just a signed count and a file handle. > * Uncontended case is exactly like Linus' version (i.e., down() is decl + > js, up() is incl()). > * The contention syscall was (in my implementation) an ioctl on the FH; the > FH was a special one, from a private syscall (although with the new VFS I'd > have written it as just another specialized FS, or even referred into the > SysVsem FS). This is roughly the way I would prefer it. But I would dedicate a whole page to this struct, since this is the granularity we can decide sharing on. This also has the advantage, that we can include a lot of debugging info into this page, too. Some people would like to know current contenders, up/down ratio per second and contender etc. Why? We have the infrastructure and all the semantics already in place and it is well known to the programmers. We know how we inherit this stuff, what will happen on process termination and so on. I thought about this myself a lot, but just didn't like the idea to trust user space for up/down. I thought about abusing read() and write() for down() and up(). Just doing it partially in user space would be an significant speedup, once you got it right. Maybe we can even combine both of it like this: Then user space can do: /* open or create sema4 with normal open semantics */ fd=open("/dev/sema4/myone"); sema4=mmap(NULL,getpagesize(),,,fd,0); /* up */ atomic_inc_and_test_for_zero(sema4) && ioctl(fd,WAKE_SLEEPERS,NULL); /* down */ atomic_dec_and_test_negative(sema4) && ioctl(fd,SLEEP_NOW,NULL); or /* open or create sema4 with normal open semantics */ fd=open("/dev/sema4/myone"); /* up */ write(fd, NULL,0); /* do the atomic stuff and wakeup in kernel */ /* this might be stupid, but COULD be implemented */ /* add 4 items to counter */ write(fd, NULL, 4); /* down */ read(sama4, NULL, 0); We could even do trylock() by default, if we open O_NONBLOCK. Or we could do trylock sometimes using select() and poll(). This also makes it easy to add it to existing select() loops like Motif. This differences could even be hidden by the libc. IIRC there are some archs, which cannot do atomic operations without privileged instructions, which is not acceptable in user space. Also there are archs, which are not cache coherent (think NUMA) and where flushing these caches to the other CPUs is privileged. Last but not least there are clusters with process migration. My twofold approach would solve all these problems rather simply. It would be a libc decision on what to use now. And the libc knows enough about the application to handle all these cases. The only thing we still need, is what we do if a contender or waiter ist killed. Should we send SIGPIPE? Should we simply wake all the waiters? And we are not creating a new namespace again, but just use the standard UN*X one: File name space. Hopes this "fit into namespace" solution will be considered, because I don't like to have a new linux-only API with completely new semantics and things to care in wrappers, even if you don't use this stuff. I also don't like the "kill me if I do a mistake" that Linus proposed in the "bad_sem" label. Comments? Flames? Overengineered? Regards Ingo Oeser -- 10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag> <<<<<<<<<<<< been there and had much fun >>>>>>>>>>>> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 8:20 ` Alon Ziv 2001-04-19 8:52 ` Abramo Bagnara 2001-04-19 9:08 ` Ingo Oeser @ 2001-04-19 11:51 ` Alan Cox 2001-04-19 16:03 ` Linus Torvalds 3 siblings, 0 replies; 60+ messages in thread From: Alan Cox @ 2001-04-19 11:51 UTC (permalink / raw) To: Alon Ziv Cc: Kernel Mailing List, Mike Kravetz, Ulrich Drepper, Linus Torvalds > My lightweight-semaphores were actually even simpler in userspace: > * the userspace struct was just a signed count and a file handle. > * Uncontended case is exactly like Linus' version (i.e., down() is decl + > js, up() is incl()). > * The contention syscall was (in my implementation) an ioctl on the FH; the > FH was a special one, from a private syscall (although with the new VFS I'd > have written it as just another specialized FS, or even referred into the > SysVsem FS). Which raises an even more interesting question. Suppose your semaphore function wanst a magic file system but was flock on a standard file ? The contention overhead is rather less nice than Linus proposal but it ought 8) to work without any kernel patches ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 8:20 ` Alon Ziv ` (2 preceding siblings ...) 2001-04-19 11:51 ` Alan Cox @ 2001-04-19 16:03 ` Linus Torvalds 2001-04-19 16:38 ` Alan Cox 2001-04-19 18:48 ` Olaf Titz 3 siblings, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 16:03 UTC (permalink / raw) To: Alon Ziv; +Cc: Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Alon Ziv wrote: > > * the userspace struct was just a signed count and a file handle. The main reason I wanted to avoid a filehandle is just because it's another name space that people already use, and that people know what the semantics are for (ie "open()" is _defined_ to return the "lowest available file descriptor", and people depend on that). So if you use a file handle, you'd need to do magic - open it, and then use dup2() to move it up high, or something. Which has its own set of problems: just _how_ high woul dyou move it? Would it potentially disturb an application that opens thousands of files, and knows that they get consecutive file descriptors? Which is _legal_ and well-defined in UNIX. However, I'm not married to the secure hash version - you could certainly use another name-space, and something more akin to file descriptors. You should be aware of issues like the above, though. Maybe it would be ok to say "if you use fast semaphores, they use file descriptors and you should no longer depend on consecutive fd's". But note how that might make it really nasty for things like libraries: can libraries use fast semaphores behind the back of the user? They might well want to use the semaphores exactly for things like memory allocator locking etc. But libc certainly cant use fd's behind peoples backs. So personally, I actually think that you must _not_ use file descriptors. But that doesn't mean that you couldn't have a more "file-desciptor-like" approach. Side note: the design _should_ allow for "lazy initialization". In particular, it should be ok for FS_create() to not actually do a system call at all, but just initialize the count and set a "uninitialized" flag. And then the actual initialization would be done at "FS_down()" time, and only if contention happens. Why? Note that there are many cases where contention simply _cannot_ happen. The classic one is a thread-safe library that is used both by threaded applications and by single-threaded ones, where the single-threaded one would never actually trigger contention. For these kinds of reasons it would actually be best to make try to abstract the interfaces (notably the system call interface) as much as possible, so that you can change the implementation inside the kernel without having to recompile applications that use it. So the sanest implementation might be one where - FS_create is a system call that just gets a 128-byte area and an ID. - the contention cases are plain system calls with no user-mode part to them at all. This allows people to modify the behaviour of the semaphores later, _without_ having any real coupling between user-mode expectations and kernel implementation. For example, if the user-mode library actually does a physical "open()" or plays games with file descriptors itself, we will -always- be stuck with the fd approach, and we can never fix it. But if you have opaque system calls, you mist start out with a system call that internally just does the equivalent of the "open a file descriptor and hide it in the semaphore", and later on the thing can be changed to do whatever else without the user program ever even realizing.. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:03 ` Linus Torvalds @ 2001-04-19 16:38 ` Alan Cox 2001-04-19 16:46 ` Linus Torvalds 2001-04-19 19:47 ` Ulrich Drepper 2001-04-19 18:48 ` Olaf Titz 1 sibling, 2 replies; 60+ messages in thread From: Alan Cox @ 2001-04-19 16:38 UTC (permalink / raw) To: Linus Torvalds Cc: Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper > can libraries use fast semaphores behind the back of the user? They might > well want to use the semaphores exactly for things like memory allocator > locking etc. But libc certainly cant use fd's behind peoples backs. libc is entitled to, and most definitely does exactly that. Take a look at things like gethostent, getpwent etc etc. Alan ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:38 ` Alan Cox @ 2001-04-19 16:46 ` Linus Torvalds 2001-04-19 17:12 ` Alan Cox ` (3 more replies) 2001-04-19 19:47 ` Ulrich Drepper 1 sibling, 4 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-19 16:46 UTC (permalink / raw) To: Alan Cox; +Cc: Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, 19 Apr 2001, Alan Cox wrote: > > can libraries use fast semaphores behind the back of the user? They might > > well want to use the semaphores exactly for things like memory allocator > > locking etc. But libc certainly cant use fd's behind peoples backs. > > libc is entitled to, and most definitely does exactly that. Take a look at > things like gethostent, getpwent etc etc. Ehh.. I will bet you $10 USD that if libc allocates the next file descriptor on the first "malloc()" in user space (in order to use the semaphores for mm protection), programs _will_ break. You want to take the bet? Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:46 ` Linus Torvalds @ 2001-04-19 17:12 ` Alan Cox 2001-04-19 22:35 ` Rogier Wolff 2001-04-20 9:29 ` Olaf Titz ` (2 subsequent siblings) 3 siblings, 1 reply; 60+ messages in thread From: Alan Cox @ 2001-04-19 17:12 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper > > libc is entitled to, and most definitely does exactly that. Take a look at > > things like gethostent, getpwent etc etc. > > Ehh.. I will bet you $10 USD that if libc allocates the next file > descriptor on the first "malloc()" in user space (in order to use the > semaphores for mm protection), programs _will_ break. > > You want to take the bet? Its not normally a good idea to take a Linus bet, but this time Im obviously missing something. fd0-2 will be passed in (and if not then shit already happens - see old bugtraq on the matter for setuid apps, glibc bugs) So the C library gets fd 3 My first fopen gets fd 4. That can already happen and isnt new. Several profiling libraries on Unix have precisely this effect already. They dynamic link/loader will also open file handles to do mmaps although generally you wont see those as they are closed again after mapping. Internationalisation code in glibc will also open and map tables during startup ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 17:12 ` Alan Cox @ 2001-04-19 22:35 ` Rogier Wolff 0 siblings, 0 replies; 60+ messages in thread From: Rogier Wolff @ 2001-04-19 22:35 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper Alan Cox wrote: > > > libc is entitled to, and most definitely does exactly that. Take a look at > > > things like gethostent, getpwent etc etc. > > > > Ehh.. I will bet you $10 USD that if libc allocates the next file > > descriptor on the first "malloc()" in user space (in order to use the > > semaphores for mm protection), programs _will_ break. > > > > You want to take the bet? > > Its not normally a good idea to take a Linus bet, but this time Im obviously > missing something. fd0-2 will be passed in (and if not then shit already > happens - see old bugtraq on the matter for setuid apps, glibc bugs) > > So the C library gets fd 3 > My first fopen gets fd 4. Code may close (0); close (1); close (2); ... malloc (); /* Now open our controlling TTY/ stdin .. */ fd = open (... ) ; After taking care of this (*), problem I find the fd trick WAY more appealing than Linus' magic numbers. With file descriptors we have a "small integer which can be validated quickly". We also have storage for a private pointer somewhere in the fd structure. If people are TOO afraid of breaking something, creating a new set of small integers handled similarly as "fds" would do fine. (Maybe here we'd allocate just a few, and reallocate when neccesary). Roger. (*) I bet that get_sem_fd () { int rv; int fd; fd = get_fd (); if (fd < 5) { rv = get_sem_fd (); close(fd); fd = rv; } return fd; } will not break much. (UGLY coding. Don't tell me.) -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 ** *-- BitWizard writes Linux device drivers for any device you may have! --* * There are old pilots, and there are bold pilots. * There are also old, bald pilots. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:46 ` Linus Torvalds 2001-04-19 17:12 ` Alan Cox @ 2001-04-20 9:29 ` Olaf Titz 2001-04-20 14:19 ` Jesse Pollard 2001-04-20 23:33 ` Linus Torvalds 2001-04-23 19:18 ` Ingo Oeser 2001-04-24 0:19 ` David Wagner 3 siblings, 2 replies; 60+ messages in thread From: Olaf Titz @ 2001-04-20 9:29 UTC (permalink / raw) To: linux-kernel > Ehh.. I will bet you $10 USD that if libc allocates the next file > descriptor on the first "malloc()" in user space (in order to use the > semaphores for mm protection), programs _will_ break. Of course, but this is a result from sloppy coding. In general, open() can just return anything and about the only case where you can even think of ignoring its result is this: close(0); close(1); close(2); open("/dev/null", O_RDWR); dup(0); dup(0); (which is even not clean for other reasons). I can't imagine depending on the "fact" that the first fd I open is 3, the next is 4, etc. And what if the routine in question is not malloc() but e.g. getpwuid()? Both are just arbitrary library functions, and one of them clearly does open file descriptors, depending on their implementation. What would the reason[1] be for wanting contiguous fd space anyway? Olaf [1] apart from not having understood how poll() works of course. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-20 9:29 ` Olaf Titz @ 2001-04-20 14:19 ` Jesse Pollard 2001-04-20 18:36 ` Olaf Titz 2001-04-20 23:33 ` Linus Torvalds 1 sibling, 1 reply; 60+ messages in thread From: Jesse Pollard @ 2001-04-20 14:19 UTC (permalink / raw) To: olaf, linux-kernel Olaf Titz <olaf@bigred.inka.de>: > > Ehh.. I will bet you $10 USD that if libc allocates the next file > > descriptor on the first "malloc()" in user space (in order to use the > > semaphores for mm protection), programs _will_ break. > > Of course, but this is a result from sloppy coding. In general, open() > can just return anything and about the only case where you can even > think of ignoring its result is this: > close(0); close(1); close(2); > open("/dev/null", O_RDWR); dup(0); dup(0); > (which is even not clean for other reasons). > > I can't imagine depending on the "fact" that the first fd I open is 3, > the next is 4, etc. And what if the routine in question is not > malloc() but e.g. getpwuid()? Both are just arbitrary library > functions, and one of them clearly does open file descriptors, > depending on their implementation. > > What would the reason[1] be for wanting contiguous fd space anyway? > > Olaf > > [1] apart from not having understood how poll() works of course. Optimization use in select: If all "interesting" file id's are known to be below "n", then only the first "n" bits in a FD_ISSET need to be examined. As soon as the bits are scattered, it takes MUCH longer to check for activity.... It may not be the "best" way, but what I tend to do is: Umm - this is snipped from a multiplexed logger using FIFOs for and indeterminate amount of data from differet utilities sending text buffers (normally one line at a time but could be more). static void fd_init(argc,argv) int argc; /* number of parameters */ char **argv; /* parameter list */ { int i,j; /* scratch counters */ static char str[50]; pnames = argv; FD_ZERO(&in_files); /* init all file descriptor sets */ for (i = 0; i <= MAX_LOG && i < argc; i++) { sprintf(str,"/tmp/%s",pnames[i]); mkfifo(str,0600); /* assume it exists */ inlogfd[i] = open(str,O_RDONLY | O_NDELAY); FD_SET(inlogfd[i],&in_files); } used = i; } Then I can scan for any activity by: do { while (select(MAX_LOG,&active,NULL,NULL,NULL) >= 0) { for(i = 0; i <= used; i++) { if (FD_ISSET(inlogfd[i],&active)) { r=ioctl(inlogfd[i],FIONREAD,&n); while (n > 0) { r = (n > BUF_MAX - 1) ? BUF_MAX - 1: n; read(inlogfd[i],buf,r); printbuf(pnames[i],r); n -= r; } } } active = in_files; } } while (errno == EINTR); ------------------------------------------------------------------------- Jesse I Pollard, II Email: pollard@navo.hpc.mil Any opinions expressed are solely my own. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-20 14:19 ` Jesse Pollard @ 2001-04-20 18:36 ` Olaf Titz 0 siblings, 0 replies; 60+ messages in thread From: Olaf Titz @ 2001-04-20 18:36 UTC (permalink / raw) To: Jesse Pollard; +Cc: linux-kernel > Optimization use in select: If all "interesting" file id's are known > to be below "n", then only the first "n" bits in a FD_ISSET need to > be examined. As soon as the bits are scattered, it takes MUCH longer > to check for activity.... That's an optimization, not a correctness issue. > for (i = 0; i <= MAX_LOG && i < argc; i++) { > sprintf(str,"/tmp/%s",pnames[i]); > mkfifo(str,0600); /* assume it exists */ > inlogfd[i] = open(str,O_RDONLY | O_NDELAY); > FD_SET(inlogfd[i],&in_files); > } This works regardless of what the open() returns. What does not work is using MAX_LOG (assuming it is constant) later in the following form: > while (select(MAX_LOG,&active,NULL,NULL,NULL) >= 0) { I see no way around computing the maximum of the inlogfd[i] values +1. (Which can of course be done just after the opens above. Note that the last opened fd _is_ guaranteed to get the highest number; FD_SET is one of the library routines where you can be pretty confident they don't open fds...) Btw. there are two problems even assuming you do get contiguous fds: - an off by one error in the case of argc > MAX_LOG, the first argument of select() is maximum fd _plus one_ - from an optimization POV it is highly advisable to take only the real maximum anyway. Olaf ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-20 9:29 ` Olaf Titz 2001-04-20 14:19 ` Jesse Pollard @ 2001-04-20 23:33 ` Linus Torvalds 2001-04-21 4:06 ` fd allocation [was: light weight user level semaphores] Edgar Toernig ` (2 more replies) 1 sibling, 3 replies; 60+ messages in thread From: Linus Torvalds @ 2001-04-20 23:33 UTC (permalink / raw) To: linux-kernel In article <E14qXEU-0005xo-00@g212.hadiko.de>, Olaf Titz <olaf@bigred.inka.de> wrote: >> Ehh.. I will bet you $10 USD that if libc allocates the next file >> descriptor on the first "malloc()" in user space (in order to use the >> semaphores for mm protection), programs _will_ break. > >Of course, but this is a result from sloppy coding. ABSOLUTELY NOT! This is guaranteed behaviour of UNIX. You get file handles in order, or you don't get them at all. Sure, some library functions are allowed to use up file handles. But most sure as hell are NOT. > In general, open() >can just return anything and about the only case where you can even >think of ignoring its result is this: > close(0); close(1); close(2); > open("/dev/null", O_RDWR); dup(0); dup(0); Which is quite common to do. Imagine a server that starts up another process, which does exactly something like the above: the _usual_ execve() case looks something like pid = fork(); if (!pid) { close(0); close(1); dup(pipe[0]); /* input pipe */ dup(pipe[1]); /* output pipe */ execve("child"); exit(1); } The above is absolutely _standard_ behaviour. It's required to work. And btw, it's _still_ required to work even if there happens to be a "malloc()" in between the close() and the dup() calls. Trust me. You're arguing for clearly broken behaviour. malloc() and friends MUST NOT open file descriptors. It _will_ break programs that rely on traditional and documented features. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: fd allocation [was: light weight user level semaphores] 2001-04-20 23:33 ` Linus Torvalds @ 2001-04-21 4:06 ` Edgar Toernig 2001-04-22 9:48 ` Olaf Titz 2001-04-21 10:13 ` Olaf Titz 2001-04-23 15:34 ` Jeff Garzik 2 siblings, 1 reply; 60+ messages in thread From: Edgar Toernig @ 2001-04-21 4:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > > pid = fork(); > if (!pid) { > close(0); > close(1); > dup(pipe[0]); /* input pipe */ > dup(pipe[1]); /* output pipe */ > execve("child"); > exit(1); > } > > The above is absolutely _standard_ behaviour. It's required to work. > > And btw, it's _still_ required to work even if there happens to be a > "malloc()" in between the close() and the dup() calls. Right. This is expected (and defined) behaviour. But do you have _any_ example where this is used for fds > 2? I can't remember. And IMHO that would be pretty fragile too. Shell scripts sometimes open temporary fds > 2 and these are passed to called programs. I.e. #!/bin/sh exec 3>log echo >&3 "script started" ls /proc/self/fd # gets fd3 already opened ls /proc/self/fd 4</dev/null # now 3 and 4 already in use... # or look into any configure script... So, IMHO as long as some library does not mess with fds 0, 1, and 2 it should be ok [1]. Yes, it would be against the standard but I still have to find some code where this semantic is used for fds > 2. Ciao, ET. PS: I would prefer to keep the standard semantics but the reasons for that are pretty weak ... ;-) PPS: Even your sample code is fragile. It breaks if I start it with ./a.out <&- ;-) (the close(0) is likely to close one end of the pipe) [1] Unintentionally setting the controlling tty may be a problem. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: fd allocation [was: light weight user level semaphores] 2001-04-21 4:06 ` fd allocation [was: light weight user level semaphores] Edgar Toernig @ 2001-04-22 9:48 ` Olaf Titz 2001-04-22 11:41 ` light weight user level semaphores Alon Ziv ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Olaf Titz @ 2001-04-22 9:48 UTC (permalink / raw) To: linux-kernel > So, IMHO as long as some library does not mess with fds 0, 1, and 2 > it should be ok [1]. Yes, it would be against the standard but I >... > [1] Unintentionally setting the controlling tty may be a problem. The controlling tty is not what is first opened to fd 0 but what is first opened, so this problem can occur at any time. Olaf ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 9:48 ` Olaf Titz @ 2001-04-22 11:41 ` Alon Ziv 2001-04-22 12:44 ` Alan Cox 2001-04-22 11:41 ` Alon Ziv 2001-04-22 14:18 ` David Woodhouse 2 siblings, 1 reply; 60+ messages in thread From: Alon Ziv @ 2001-04-22 11:41 UTC (permalink / raw) To: linux-kernel All of this FD allocation stuff is truly distrurbing. This appears to be the one place where Win32 got it (almost) right--- quite about every kernel object looks to userland just like an opaque handle, and the same operations apply to all of them. So (e.g.) a mixed wait for socket operation or a semaphore or a timer is very simple. The only abstraction we have that is even remotely similar is the FD, yet its semantics are far too strict to use this way. The only remotely-feasible idea I've had, so far, was to allow "negative" FDs (i.e., numbered 0x80000000+) to be used for semaphores; this sidesteps the POSIX requirements (= we can just claim we don't support more than 2G FDs per process), but still leaves us with the problems of managing a split (or extremely large) FD table _and_ with the issue of allocation policy... Besides, as Linus already said, FDs are likely not the right abstraction for objects without file behavior, like semaphores or timers. [BTW, another solution is to truly support opaque "handles" to kernel objects; I believe David Howells is already working on something like this for Wine? The poll interface can be trivially extended to support waiting on those...] -az ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 11:41 ` light weight user level semaphores Alon Ziv @ 2001-04-22 12:44 ` Alan Cox 2001-04-22 15:19 ` Alon Ziv 0 siblings, 1 reply; 60+ messages in thread From: Alan Cox @ 2001-04-22 12:44 UTC (permalink / raw) To: Alon Ziv; +Cc: linux-kernel > All of this FD allocation stuff is truly distrurbing. > This appears to be the one place where Win32 got it (almost) right--- > quite about every kernel object looks to userland just like an opaque > handle, and the same operations apply to all of them. Unix got this right, then AT&T broke it in System III. One very good reason for pipe based semaphore stuff is precisely that it works in poll/select/SIGIO Alan ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 12:44 ` Alan Cox @ 2001-04-22 15:19 ` Alon Ziv 2001-04-22 14:31 ` Alexander Viro 0 siblings, 1 reply; 60+ messages in thread From: Alon Ziv @ 2001-04-22 15:19 UTC (permalink / raw) To: linux-kernel; +Cc: Alan Cox Well, that's the reason for my small-negative-integer semaphore-FD idea... (It won't support select() easily, but poll() is prob'ly good enough) Still, there is the problem of read()/write()/etc. semantics; sure, we can declare that 'negative FDs' have their own semantics which just happen to include poll(), but it sure looks like a kludge... -az ----- Original Message ----- From: "Alan Cox" <alan@lxorguk.ukuu.org.uk> To: "Alon Ziv" <alonz@nolaviz.org> Cc: <linux-kernel@vger.kernel.org> Sent: Sunday, April 22, 2001 14:44 Subject: Re: light weight user level semaphores > > All of this FD allocation stuff is truly distrurbing. > > This appears to be the one place where Win32 got it (almost) right--- > > quite about every kernel object looks to userland just like an opaque > > handle, and the same operations apply to all of them. > > Unix got this right, then AT&T broke it in System III. One very good reason > for pipe based semaphore stuff is precisely that it works in poll/select/SIGIO > > Alan > > > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 15:19 ` Alon Ziv @ 2001-04-22 14:31 ` Alexander Viro 2001-04-22 16:08 ` Alon Ziv 0 siblings, 1 reply; 60+ messages in thread From: Alexander Viro @ 2001-04-22 14:31 UTC (permalink / raw) To: Alon Ziv; +Cc: linux-kernel, Alan Cox On Sun, 22 Apr 2001, Alon Ziv wrote: > Well, that's the reason for my small-negative-integer semaphore-FD idea... > (It won't support select() easily, but poll() is prob'ly good enough) > Still, there is the problem of read()/write()/etc. semantics; sure, we can > declare that 'negative FDs' have their own semantics which just happen to > include poll(), but it sure looks like a kludge... You _still_ don't get it. The question is not "how to add magic kernel objects that would look like descriptors and support a binch of ioctls, allowing to do semaphores", it's "do we need semaphores to be kernel-level objects". Implementation with pipes allows to avoid the magic crap - they are real, normal pipes - nothing special from the kernel POV. read(), write(), etc. are nothing but reading and writing for pipes. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 14:31 ` Alexander Viro @ 2001-04-22 16:08 ` Alon Ziv 0 siblings, 0 replies; 60+ messages in thread From: Alon Ziv @ 2001-04-22 16:08 UTC (permalink / raw) To: linux-kernel; +Cc: Alan Cox, Alexander Viro Oh, I don't argue about that. (Well, almost--- see below...) It's just that we need _some_ method for getting over the silly POSIX FD-handling restrictions... And the negative-FDs may be the solution. (Note I said we 'can' declare other semantics; not 'should'. So these FDs can still be normal ones, just at the other end of the numbering range...) My misgivings are: * There's no way to integrate other signalling mechanisms; e.g., we may wish for a 'wake-all-waiters' signaller, or for a 'timed-wait' that arrives via an FD and not as a signal * a pipe is a more-or-less good semaphore; it may be too heavyweight, as it's forced to pass useless [in this case] info, and we can't control its wakeup order [although POSIX doesn't seem to require this] [ Actually, I once had an idea of binding signals into an FD, so they can be 'read' out of it... with that, an alarm() is a 'timed-wait' waitable by poll() :-) ] -az ----- Original Message ----- From: "Alexander Viro" <viro@math.psu.edu> To: "Alon Ziv" <alonz@nolaviz.org> Cc: <linux-kernel@vger.kernel.org>; "Alan Cox" <alan@lxorguk.ukuu.org.uk> Sent: Sunday, April 22, 2001 16:31 Subject: Re: light weight user level semaphores > > > On Sun, 22 Apr 2001, Alon Ziv wrote: > > > Well, that's the reason for my small-negative-integer semaphore-FD idea... > > (It won't support select() easily, but poll() is prob'ly good enough) > > Still, there is the problem of read()/write()/etc. semantics; sure, we can > > declare that 'negative FDs' have their own semantics which just happen to > > include poll(), but it sure looks like a kludge... > > You _still_ don't get it. The question is not "how to add magic kernel > objects that would look like descriptors and support a binch of > ioctls, allowing to do semaphores", it's "do we need semaphores > to be kernel-level objects". Implementation with pipes allows to avoid > the magic crap - they are real, normal pipes - nothing special from > the kernel POV. read(), write(), etc. are nothing but reading and writing > for pipes. > > > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 9:48 ` Olaf Titz 2001-04-22 11:41 ` light weight user level semaphores Alon Ziv @ 2001-04-22 11:41 ` Alon Ziv 2001-04-22 14:18 ` David Woodhouse 2 siblings, 0 replies; 60+ messages in thread From: Alon Ziv @ 2001-04-22 11:41 UTC (permalink / raw) To: linux-kernel All of this FD allocation stuff is truly distrurbing. This appears to be the one place where Win32 got it (almost) right--- quite about every kernel object looks to userland just like an opaque handle, and the same operations apply to all of them. So (e.g.) a mixed wait for socket operation or a semaphore or a timer is very simple. The only abstraction we have that is even remotely similar is the FD, yet its semantics are far too strict to use this way. The only remotely-feasible idea I've had, so far, was to allow "negative" FDs (i.e., numbered 0x80000000+) to be used for semaphores; this sidesteps the POSIX requirements (= we can just claim we don't support more than 2G FDs per process), but still leaves us with the problems of managing a split (or extremely large) FD table _and_ with the issue of allocation policy... Besides, as Linus already said, FDs are likely not the right abstraction for objects without file behavior, like semaphores or timers. [BTW, another solution is to truly support opaque "handles" to kernel objects; I believe David Howells is already working on something like this for Wine? The poll interface can be trivially extended to support waiting on those...] -az ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 9:48 ` Olaf Titz 2001-04-22 11:41 ` light weight user level semaphores Alon Ziv 2001-04-22 11:41 ` Alon Ziv @ 2001-04-22 14:18 ` David Woodhouse 2001-04-23 13:19 ` David Howells 2 siblings, 1 reply; 60+ messages in thread From: David Woodhouse @ 2001-04-22 14:18 UTC (permalink / raw) To: Alon Ziv; +Cc: linux-kernel, dhowells alonz@nolaviz.org said: > [BTW, another solution is to truly support opaque "handles" to kernel > objects; I believe David Howells is already working on something like > this for Wine? The poll interface can be trivially extended to support > waiting on those...] ISTR it wasn't quite trivial to do it that way - it would require the addition of an extra argument to the fops->poll() method. David? -- dwmw2 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-22 14:18 ` David Woodhouse @ 2001-04-23 13:19 ` David Howells 2001-04-23 14:48 ` Alon Ziv 0 siblings, 1 reply; 60+ messages in thread From: David Howells @ 2001-04-23 13:19 UTC (permalink / raw) To: David Woodhouse; +Cc: alonz, linux-kernel David Woodhouse <dwmw2@infradead.org> wrote: > alonz@nolaviz.org said: > > [BTW, another solution is to truly support opaque "handles" to kernel > > objects; I believe David Howells is already working on something like > > this for Wine? Yes. However, it uses a different system call set to use them. They translate to small object structures internally. > > The poll interface can be trivially extended to support > > waiting on those...] No, they aren't files. I did not want to use "files" because this would incur a fairly major penalty for each object: struct file + struct dentry + struct inode Which would mean that Win32 File objects would require two of each, one set to hold the extra Win32 attributes and one set for the actual Linux file. The way I've chosen uses somewhat less memory and should be faster. > ISTR it wasn't quite trivial to do it that way - it would require the > addition of an extra argument to the fops->poll() method. Yes, the PulseEvent operation demands that all processes currently waiting on the event should be woken, but that no processes attaching immediately afterward get triggered. This means that the PulseEvent handler has to be able to notify all the processes currently waiting on the queue and only those processes. I got it to do this by marking the waiter records each process links into the queue. Oh... and WaitForMultipleObjects also has a "wait for all" option. David ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-23 13:19 ` David Howells @ 2001-04-23 14:48 ` Alon Ziv 2001-04-23 15:40 ` David Howells 0 siblings, 1 reply; 60+ messages in thread From: Alon Ziv @ 2001-04-23 14:48 UTC (permalink / raw) To: linux-kernel; +Cc: David Howells From: "David Howells" <dhowells@warthog.cambridge.redhat.com> > David Woodhouse <dwmw2@infradead.org> wrote: > > alonz@nolaviz.org said: > > > [BTW, another solution is to truly support opaque "handles" to kernel > > > objects; I believe David Howells is already working on something like > > > this for Wine? > > Yes. However, it uses a different system call set to use them. They translate > to small object structures internally. > Obviously... since they're handles, not FDs... [BTW, are you using Windows' idea of storing the objects in process space, in a page that's inaccessible to the app itself, and passing pointers into this page as the handles?] > > > The poll interface can be trivially extended to support > > > waiting on those...] > > No, they aren't files. I did not want to use "files" because this would incur > a fairly major penalty for each object: > So what if they aren't files? If you look at (e.g.) AIX's poll(), it allows you to put SysV semaphore IDs in pollfd structures. (Actually they do even more--- they have an extended pollfd struct; but even without it, just putting a handle instead of FD and a special event code in a normal pollfd should suffice...) > struct file + struct dentry + struct inode > > Which would mean that Win32 File objects would require two of each, one set to > hold the extra Win32 attributes and one set for the actual Linux file. > I'm afraid I'm not following your logic in this; I believe most Win32 attrs can be mapped to more generic abstractions which should be able to exist at 'struct file' level. (And even if not, a Win32 file handle could just hold two pointers--- one to the 'struct file', and one to the extra attrs...) > The way I've chosen uses somewhat less memory and should be faster. > And breaks _completely_ with the existing scheme :-/ > > ISTR it wasn't quite trivial to do it that way - it would require the > > addition of an extra argument to the fops->poll() method. > > Yes, the PulseEvent operation demands that all processes currently waiting on > the event should be woken, but that no processes attaching immediately > afterward get triggered. > Huh? Where did you get this? Looking at my copy of MSDN (July '00), the PulseEvent remarks more-or-less suggest an implementation like SetEvent(e) ResetEvent(e) I don't see any mention of 'currently waiting' vs 'new' waiters. (Besides, I doubt MS tries to solve this in the SMP case...) > Oh... and WaitForMultipleObjects also has a "wait for all" option. Yes, this is a valid point... I wonder if it's possible to add _just_ this to poll()... -az ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-23 14:48 ` Alon Ziv @ 2001-04-23 15:40 ` David Howells 0 siblings, 0 replies; 60+ messages in thread From: David Howells @ 2001-04-23 15:40 UTC (permalink / raw) To: Alon Ziv; +Cc: linux-kernel Alon Ziv <alonz@nolaviz.org> wrote: > Obviously... since they're handles, not FDs... > [BTW, are you using Windows' idea of storing the objects in process space, > in a page that's inaccessible to the app itself, and passing pointers into > this page as the handles?] No... I grab a page in kernel space and use it as an array. One problem is that if an exit occurs, I have to be able to discard all attached objects after the process's VM has been cleaned up (ie: what if it gets swapped out?). Plus, mmap can clobber existing mappings, MapViewOfFile can't. > So what if they aren't files? Small structures private to my Win32 module. > I'm afraid I'm not following your logic in this; I believe most Win32 attrs > can be mapped to more generic abstractions which should be able to exist at > 'struct file' level. "Most"... It'd mean adding extra fields into struct file (and possibly struct inode) just for the use of this module (which probably wouldn't be accepted). > (And even if not, a Win32 file handle could just hold two pointers--- No. the extra data has to be accessible from CreateFile (potentially running in other processes), and this'd mean it'd have to go speculatively searching all Win32 handle tables currently in use. > And breaks _completely_ with the existing scheme :-/ So what? This is for a WINE accelerator/Win32 module only. There's already been an argument over making the whole lot available as general Linux functionality, but most people said that it'd be a bad idea because it'd not be portable. > Huh? Where did you get this? > Looking at my copy of MSDN (July '00), the PulseEvent remarks more-or-less > suggest an implementation like > SetEvent(e) > ResetEvent(e) Consider the following: WAITER 1 WAITER 2 WAITER 3 WAKER wait-on-event wait-on-event wait-on-event sleep sleep sleep PulseEvent set-event wake(WAITER 1) wake(WAITER 2) wake(WAITER 3) reset-event wake wake wake what-happened? what-happened? what-happened? nothing! nothing! nothing! sleep sleep sleep All three waiters should wake up with a note that the event triggered, but they don't. Plus a fourth waiter who begins to wait on the event after the set-event is issue probably shouldn't wake up. > I wonder if it's possible to add _just_ this to poll()... No... there's no way to pass this to poll (or select). Better to add a WaitForMultipleObjects() syscall and have that call do_select() with a flag. David ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-20 23:33 ` Linus Torvalds 2001-04-21 4:06 ` fd allocation [was: light weight user level semaphores] Edgar Toernig @ 2001-04-21 10:13 ` Olaf Titz 2001-04-23 15:34 ` Jeff Garzik 2 siblings, 0 replies; 60+ messages in thread From: Olaf Titz @ 2001-04-21 10:13 UTC (permalink / raw) To: linux-kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 2617 bytes --] > This is guaranteed behaviour of UNIX. You get file handles in order, or > you don't get them at all. You get the _next free_ file handle in order. What if your program assumes they are all contiguous, and it is called by some other program which forgot about FD_CLOEXEC and has some higher fds still open? (xdm did this for ten years with its listening socket, just to name a well-known example. So every program which asssumes contiguous fd allocations would fail if started from an xdm session.) If your program makes assumptions on its environment which are not guaranteed it's broken. What _is_ guaranteed is that after consecutive allocations of fds like for (i=0; i<n; ++i) fd[i]=open(...); the following property holds: fd[i] > fd[j] if (i > j and fd[i]!=-1 and fd[j]!=-1). What is absolutely nowhere guaranteed is that fd[i+1] = fd[i]+1. It is not possible to guarantee this since any fd may be already open before main() starts. Of course you can guarantee that the fds are available like this: for (i=getdtablesize(); i>=0; --i) close(i); and not calling library functions which may open fds. > pid = fork(); > if (!pid) { > close(0); > close(1); > dup(pipe[0]); /* input pipe */ > dup(pipe[1]); /* output pipe */ > execve("child"); > exit(1); > } > > The above is absolutely _standard_ behaviour. It's required to work. The reason why it works is that (a) the target fds are 0 and 1, and (b) you close them explicitly. For less trivial uses, there is always dup2(). > And btw, it's _still_ required to work even if there happens to be a > "malloc()" in between the close() and the dup() calls. I wouldn't count on that. It's clearly not required to work if there's a getpwnam() in between. (I already had my share of problems with syslog() in exactly this situation.) Do we need a list of library functions which may open fds, like the infamous "list of functions which may move or purge memory" on the Mac (which grew longer with every OS release and Inside Mac supplement issue)? Do we need to know for each library routine how it is implemented? > Trust me. You're arguing for clearly broken behaviour. malloc() and > friends MUST NOT open file descriptors. It _will_ break programs that > rely on traditional and documented features. Traditional and documented is, in my view, the description as of the open(2) man page: When the call is successful, the file descriptor returned will be the lowest file descriptor not currently open for the pro cess. which of course is exactly how it is implemented in the kernel. Olaf ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-20 23:33 ` Linus Torvalds 2001-04-21 4:06 ` fd allocation [was: light weight user level semaphores] Edgar Toernig 2001-04-21 10:13 ` Olaf Titz @ 2001-04-23 15:34 ` Jeff Garzik 2 siblings, 0 replies; 60+ messages in thread From: Jeff Garzik @ 2001-04-23 15:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > Trust me. You're arguing for clearly broken behaviour. malloc() and > friends MUST NOT open file descriptors. It _will_ break programs that > rely on traditional and documented features. Indeed; STDIN_FILENO and friends are constants... -- Jeff Garzik | The difference between America and England is that Building 1024 | the English think 100 miles is a long distance and MandrakeSoft | the Americans think 100 years is a long time. | (random fortune) ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:46 ` Linus Torvalds 2001-04-19 17:12 ` Alan Cox 2001-04-20 9:29 ` Olaf Titz @ 2001-04-23 19:18 ` Ingo Oeser 2001-04-24 0:19 ` David Wagner 3 siblings, 0 replies; 60+ messages in thread From: Ingo Oeser @ 2001-04-23 19:18 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Alon Ziv, Kernel Mailing List, Mike Kravetz, Ulrich Drepper On Thu, Apr 19, 2001 at 09:46:17AM -0700, Linus Torvalds wrote: > > libc is entitled to, and most definitely does exactly that. Take a look at > > things like gethostent, getpwent etc etc. > > Ehh.. I will bet you $10 USD that if libc allocates the next file > descriptor on the first "malloc()" in user space (in order to use the > semaphores for mm protection), programs _will_ break. But we would not open the semaphore on malloc() but instead in the init functions of the libc. So the semaphore will be already allocated. May be dup2()ed to some very high range (INT_MAX-__GLIBC_MALLOC_SEM_FD) and the original fd closed. So this will be no real problem. That's why I don't like lazy init: May be you cannot init anymore, if you come to and condition, where you would need it. Also init/fini are usally very slow operations and as many things as possible are burdend onto their shoulders. Semaphores tend to be structures living very long (at least in all code I've written and seen so far) so I see no point in defering their initialization. Regards Ingo Oeser -- 10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag> <<<<<<<<<<<< been there and had much fun >>>>>>>>>>>> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:46 ` Linus Torvalds ` (2 preceding siblings ...) 2001-04-23 19:18 ` Ingo Oeser @ 2001-04-24 0:19 ` David Wagner 2001-04-24 0:41 ` Alexander Viro 3 siblings, 1 reply; 60+ messages in thread From: David Wagner @ 2001-04-24 0:19 UTC (permalink / raw) To: linux-kernel Linus Torvalds wrote: >Ehh.. I will bet you $10 USD that if libc allocates the next file >descriptor on the first "malloc()" in user space (in order to use the >semaphores for mm protection), programs _will_ break. > >You want to take the bet? Good point. Speaking of which: ioctl(fd, UIOCATTACHSEMA, ...); seems to act like dup(fd) if fd was opened on "/dev/usemaclone" (see drivers/sgi/char/usema.c). According to usema(7), this is intended to help libraries implement semaphores. Is this a bad coding? Should the kernel really support an ioctl() that can silently allocate the next file descriptor? This seems like asking for trouble. Or, maybe I just misunderstood something. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-24 0:19 ` David Wagner @ 2001-04-24 0:41 ` Alexander Viro 0 siblings, 0 replies; 60+ messages in thread From: Alexander Viro @ 2001-04-24 0:41 UTC (permalink / raw) To: David Wagner; +Cc: linux-kernel On 24 Apr 2001, David Wagner wrote: > Linus Torvalds wrote: > >Ehh.. I will bet you $10 USD that if libc allocates the next file > >descriptor on the first "malloc()" in user space (in order to use the > >semaphores for mm protection), programs _will_ break. > > > >You want to take the bet? > > Good point. Speaking of which: > ioctl(fd, UIOCATTACHSEMA, ...); > seems to act like dup(fd) if fd was opened on "/dev/usemaclone" > (see drivers/sgi/char/usema.c). According to usema(7), this is > intended to help libraries implement semaphores. > > Is this a bad coding? Yes. Not to mention side effects, it's just plain ugly. Anyone who invents identifiers of _that_ level of ugliness should be forced to read them aloud for a week or so, until somebody will shoot him out of mercy. Out of curiosity: who was the author? It looks unusually nasty, even for SGI. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:38 ` Alan Cox 2001-04-19 16:46 ` Linus Torvalds @ 2001-04-19 19:47 ` Ulrich Drepper 1 sibling, 0 replies; 60+ messages in thread From: Ulrich Drepper @ 2001-04-19 19:47 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, Alon Ziv, Kernel Mailing List, Mike Kravetz Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > > can libraries use fast semaphores behind the back of the user? They might > > well want to use the semaphores exactly for things like memory allocator > > locking etc. But libc certainly cant use fd's behind peoples backs. > > libc is entitled to, and most definitely does exactly that. Take a look at > things like gethostent, getpwent etc etc. You are mixing two completely different things. Functions like gethostent() and catopen() are explicitly allowed to be implemented using file descriptors. If this is allowed the standard contains appropriate wording. Other functions like setlocale() do use file descriptors, yes, but these are not kept. Before the function returns they are closed. This can cause disruptions in other threads which find descriptors not allocated sequentially but this has to be taken into account. Rules for multi-threaded applications are different. A single-threaded application will not see such a difference. Now, the standards do not allow POSIX mutexes to be implemented using file descriptors. The same is true for unnamed POSIX semaphores. So Linus is right, though for a different reason than he thought. The situation is a bit different for named POSIX semaphores. These can be implemented using file descriptors. But they don't have to and IMO they shouldn't. A memory reference based semaphore implementation would allow a named semaphore to be implemented using fd = open (name) addr = mmap (..fd..) close (fd) sem_syscall (addr) i.e., it can be mapped to a memory reference again. -- ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: light weight user level semaphores 2001-04-19 16:03 ` Linus Torvalds 2001-04-19 16:38 ` Alan Cox @ 2001-04-19 18:48 ` Olaf Titz 1 sibling, 0 replies; 60+ messages in thread From: Olaf Titz @ 2001-04-19 18:48 UTC (permalink / raw) To: linux-kernel > problems: just _how_ high woul dyou move it? Would it potentially disturb > an application that opens thousands of files, and knows that they get > consecutive file descriptors? Which is _legal_ and well-defined in UNIX. Only if you close them before. The process may have been started with arbitrary fds open. > say "if you use fast semaphores, they use file descriptors and you should > no longer depend on consecutive fd's". Which you cannot anyway. Already some library routines can open fds although they don't explicitly say so and don't have to in all implementations, like openlog() or all the get*by*() stuff (or even dlopen()), so you are never sure to know which or how many FDs you actually have open. Olaf ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2001-04-24 0:42 UTC | newest]
Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-04-19 13:59 light weight user level semaphores George Talbot
[not found] <20010417114433.D1108@w-mikek2.sequent.com>
2001-04-17 19:48 ` Linus Torvalds
2001-04-18 18:13 ` Bernd Eckenfels
2001-04-18 19:35 ` Ulrich Drepper
2001-04-19 8:20 ` Alon Ziv
2001-04-19 8:52 ` Abramo Bagnara
2001-04-19 9:08 ` Alexander Viro
2001-04-19 10:44 ` Abramo Bagnara
2001-04-19 16:11 ` Linus Torvalds
2001-04-19 16:33 ` Alexander Viro
2001-04-19 16:43 ` Linus Torvalds
2001-04-19 17:33 ` Alexander Viro
2001-04-19 17:38 ` Linus Torvalds
2001-04-19 18:24 ` Alexander Viro
2001-04-19 19:26 ` Ulrich Drepper
2001-04-19 19:35 ` Alan Cox
2001-04-19 20:06 ` Ulrich Drepper
2001-04-19 20:11 ` Alan Cox
2001-04-19 20:26 ` Ulrich Drepper
2001-04-19 20:22 ` Ingo Oeser
2001-04-19 20:40 ` Ulrich Drepper
2001-04-19 20:51 ` Linus Torvalds
2001-04-19 21:38 ` Alan Cox
2001-04-19 20:49 ` Linus Torvalds
2001-04-19 21:18 ` Ulrich Drepper
2001-04-19 21:41 ` Linus Torvalds
2001-04-19 22:46 ` Ulrich Drepper
2001-04-20 1:35 ` Alexander Viro
2001-04-20 2:45 ` Ulrich Drepper
2001-04-19 16:43 ` Abramo Bagnara
2001-04-19 20:47 ` Ingo Oeser
2001-04-19 20:54 ` Linus Torvalds
2001-04-19 9:08 ` Ingo Oeser
2001-04-19 11:51 ` Alan Cox
2001-04-19 16:03 ` Linus Torvalds
2001-04-19 16:38 ` Alan Cox
2001-04-19 16:46 ` Linus Torvalds
2001-04-19 17:12 ` Alan Cox
2001-04-19 22:35 ` Rogier Wolff
2001-04-20 9:29 ` Olaf Titz
2001-04-20 14:19 ` Jesse Pollard
2001-04-20 18:36 ` Olaf Titz
2001-04-20 23:33 ` Linus Torvalds
2001-04-21 4:06 ` fd allocation [was: light weight user level semaphores] Edgar Toernig
2001-04-22 9:48 ` Olaf Titz
2001-04-22 11:41 ` light weight user level semaphores Alon Ziv
2001-04-22 12:44 ` Alan Cox
2001-04-22 15:19 ` Alon Ziv
2001-04-22 14:31 ` Alexander Viro
2001-04-22 16:08 ` Alon Ziv
2001-04-22 11:41 ` Alon Ziv
2001-04-22 14:18 ` David Woodhouse
2001-04-23 13:19 ` David Howells
2001-04-23 14:48 ` Alon Ziv
2001-04-23 15:40 ` David Howells
2001-04-21 10:13 ` Olaf Titz
2001-04-23 15:34 ` Jeff Garzik
2001-04-23 19:18 ` Ingo Oeser
2001-04-24 0:19 ` David Wagner
2001-04-24 0:41 ` Alexander Viro
2001-04-19 19:47 ` Ulrich Drepper
2001-04-19 18:48 ` Olaf Titz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox