* [PATCH] semaphore fairness patch against test11-pre6
@ 2000-11-18 1:01 David Mansfield
2000-11-18 9:45 ` Christoph Rohland
0 siblings, 1 reply; 7+ messages in thread
From: David Mansfield @ 2000-11-18 1:01 UTC (permalink / raw)
To: Linus Torvalds; +Cc: lkml
[-- Attachment #1: Type: text/plain, Size: 1780 bytes --]
Hi Linus et al,
I've applied your semaphore fairness patch (slightly fixed) below. It
fixes my original bug report of vmstat, ps etc. stalls waiting for the
mmap_sem. I can now run my memory 'hog' processes and actually see
vmstat update every second even under heavy memory pressure. More
importantly, ps works so I can find the pid to kill. I'm no expert in
checking for races, but I went over all (I think) the 2 process cases as
well as I could and they seem to look ok to me, but what do I know. I
know someone else reported it didn't fix the problem, but perhaps that's
some other issue.
I ran many 'ps' (20?) in the background trying to simulate many process
contention, and everything still worked fine. I've run some other
stress tests too (dbench, my own I/O throughput test etc) and so far
all's well (famous last words).
If you can find the time to check this out more completely, I recommend
it, because it seems like a great improvement to be able to accurately
see vmstat numbers in times of system load. I hope the other side
effects are beneficial as well :-)
The change to the patch was that you had 'if (sleepers > 1)' when
obviously you meant 'if (sem->sleepers > 1)'...
Here's your patch again (also attached in case of mangling):
--- linux/arch/i386/kernel/semaphore.c 2000/11/16 19:58:26 1.3
+++ linux/arch/i386/kernel/semaphore.c 2000/11/17 23:12:48
@@ -64,6 +64,14 @@
spin_lock_irq(&semaphore_lock);
sem->sleepers++;
+
+ /*
+ * Are there other people waiting for this?
+ * They get to go first.
+ */
+ if (sem->sleepers > 1)
+ goto inside;
+
for (;;) {
int sleepers = sem->sleepers;
@@ -76,6 +84,7 @@
break;
}
sem->sleepers = 1; /* us - see -1 above */
+inside:
spin_unlock_irq(&semaphore_lock);
schedule();
[-- Attachment #2: sem-patch.test11-pre6 --]
[-- Type: text/plain, Size: 747 bytes --]
Index: linux/arch/i386/kernel/semaphore.c
===================================================================
RCS file: /home/kernel/cvs_master/linux/arch/i386/kernel/semaphore.c,v
retrieving revision 1.3
diff -u -r1.3 semaphore.c
--- linux/arch/i386/kernel/semaphore.c 2000/11/16 19:58:26 1.3
+++ linux/arch/i386/kernel/semaphore.c 2000/11/17 23:12:48
@@ -64,6 +64,14 @@
spin_lock_irq(&semaphore_lock);
sem->sleepers++;
+
+ /*
+ * Are there other people waiting for this?
+ * They get to go first.
+ */
+ if (sem->sleepers > 1)
+ goto inside;
+
for (;;) {
int sleepers = sem->sleepers;
@@ -76,6 +84,7 @@
break;
}
sem->sleepers = 1; /* us - see -1 above */
+inside:
spin_unlock_irq(&semaphore_lock);
schedule();
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] semaphore fairness patch against test11-pre6 2000-11-18 1:01 [PATCH] semaphore fairness patch against test11-pre6 David Mansfield @ 2000-11-18 9:45 ` Christoph Rohland 2000-11-19 1:12 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Christoph Rohland @ 2000-11-18 9:45 UTC (permalink / raw) To: David Mansfield; +Cc: Linus Torvalds, lkml Hi David, David Mansfield <lkml@dm.ultramaster.com> writes: > If you can find the time to check this out more completely, I recommend > it, because it seems like a great improvement to be able to accurately > see vmstat numbers in times of system load. I hope the other side > effects are beneficial as well :-) I wanted to point out that there may be some performance impacts by this: We had exactly the new behaviour on SYSV semaphores. It led to very bad behaviour in high load situations since for high frequency, short critical paths this led to very high context switch rates instead of using the available time slice for the program. We changed the behaviour of SYSV semaphores to the current kernel sem behaviour and never had problems with that change. I still think that your change is right since this is kernel space and you do not have the notion of a time slice. Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] semaphore fairness patch against test11-pre6 2000-11-18 9:45 ` Christoph Rohland @ 2000-11-19 1:12 ` Andrew Morton 2000-11-19 1:47 ` Linus Torvalds 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2000-11-19 1:12 UTC (permalink / raw) To: Christoph Rohland; +Cc: David Mansfield, Linus Torvalds, lkml Christoph Rohland wrote: > > Hi David, > > David Mansfield <lkml@dm.ultramaster.com> writes: > > If you can find the time to check this out more completely, I recommend > > it, because it seems like a great improvement to be able to accurately > > see vmstat numbers in times of system load. I hope the other side > > effects are beneficial as well :-) > > I wanted to point out that there may be some performance impacts by > this: We had exactly the new behaviour on SYSV semaphores. It led to > very bad behaviour in high load situations since for high frequency, > short critical paths this led to very high context switch rates > instead of using the available time slice for the program. > > We changed the behaviour of SYSV semaphores to the current kernel sem > behaviour and never had problems with that change. > > I still think that your change is right since this is kernel space and > you do not have the notion of a time slice. Has anyone tried it on SMP? I get fairly repeatable instances of immortal `D'-state processes with this patch. The patch isn't right - it allows `sleepers' to increase without bound. But it's a boolean! If you cut out the unnecessary code and the incorrect comments from __down() it looks like this: void __down(struct semaphore * sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); tsk->state = TASK_UNINTERRUPTIBLE; add_wait_queue_exclusive(&sem->wait, &wait); spin_lock_irq(&semaphore_lock); for (;;) { /* * Effectively: * * if (sem->sleepers) * sem->count++ * if (sem->count >= 0) * sem->sleepers = 0; * break; */ if (!atomic_add_negative(sem->sleepers, &sem->count)) { sem->sleepers = 0; break; } sem->sleepers = 1; spin_unlock_irq(&semaphore_lock); schedule(); tsk->state = TASK_UNINTERRUPTIBLE; spin_lock_irq(&semaphore_lock); } spin_unlock_irq(&semaphore_lock); remove_wait_queue(&sem->wait, &wait); tsk->state = TASK_RUNNING; wake_up(&sem->wait); } I spent a couple of hours trying to get "fairness" working right and have not been able to come up with a non-racy solution. That semaphore algorithm is amazing. I'm really impressed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] semaphore fairness patch against test11-pre6 2000-11-19 1:12 ` Andrew Morton @ 2000-11-19 1:47 ` Linus Torvalds 2000-11-19 12:51 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Linus Torvalds @ 2000-11-19 1:47 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Rohland, David Mansfield, lkml On Sun, 19 Nov 2000, Andrew Morton wrote: > > Has anyone tried it on SMP? I get fairly repeatable instances of immortal > `D'-state processes with this patch. Too bad. I really thought it should be safe to do. > The patch isn't right - it allows `sleepers' to increase without bound. > But it's a boolean! It's not a boolean. It's really a "bias count". It happens to get only the values 0 and 1 simply becase the logic is that we always account for all the other people when any process goes to sleep, so "sleepers" only ever counts the one process that went to sleep last. But the algorithm itself should allow for other values. In fact, I think that you'll find that it works fine if you switch to non-exclusive wait-queues, and the only reason you see the repeatable D states is exactly the case where we didn't "take" the semaphore even though we were awake, and that basically makes us an exclusive process that didn't react to an exclusive wakeup. (Think of it this way: with the "inside" patch, the process does tsk->state = TASK_INTERRUPTIBLE; twice, even though there was only one semaphore that woke it up: we basically "lost" a wakeup event, not because "sleepers" cannot be 2, but because we didn't pick up the wakeup that we might have gotten. Instead of the "goto inside", how about just doing it without the "double sleep", and doing something like tsk->state = TASK_INTERRUPTIBLE; add_wait_queue_exclusive(&sem->wait, &wait); spin_lock_irq(&semaphore_lock); sem->sleepers ++; + if (sem->sleepers > 1) { + spin_unlock_irq(&semaphore_lock); + schedule(); + spin_lock_irq(&semaphore_lock); + } for (;;) { The only difference between the above and the "goto inside" variant is really that the above sets "tsk->state = TASK_INTERRUPTIBLE;" just once per loop, not twice as the "inside" case did. So if we happened to get an exclusive wakeup at just the right point, we won't go to sleep again and miss it. But these things are very subtle. The current semaphore algorithm was basically perfected over a week of some serious thinking. The fairness change should similarly get a _lot_ of attention. It's way too easy to miss things. Does the above work for you even in SMP? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] semaphore fairness patch against test11-pre6 2000-11-19 1:47 ` Linus Torvalds @ 2000-11-19 12:51 ` Andrew Morton 2000-11-19 18:46 ` Linus Torvalds 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2000-11-19 12:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christoph Rohland, David Mansfield, lkml Linus Torvalds wrote: > > ... > But the algorithm itself should allow for other values. In fact, I think > that you'll find that it works fine if you switch to non-exclusive > wait-queues, and the only reason you see the repeatable D states is > exactly the case where we didn't "take" the semaphore even though we were > awake, and that basically makes us an exclusive process that didn't react > to an exclusive wakeup. > > (Think of it this way: with the "inside" patch, the process does > > tsk->state = TASK_INTERRUPTIBLE; > > twice, even though there was only one semaphore that woke it up: we > basically "lost" a wakeup event, not because "sleepers" cannot be 2, but > because we didn't pick up the wakeup that we might have gotten. I don't see a path where David's patch can cause a lost wakeup in the way you describe. > Instead of the "goto inside", how about just doing it without the "double > sleep", and doing something like > > tsk->state = TASK_INTERRUPTIBLE; > add_wait_queue_exclusive(&sem->wait, &wait); > > spin_lock_irq(&semaphore_lock); > sem->sleepers ++; > + if (sem->sleepers > 1) { > + spin_unlock_irq(&semaphore_lock); > + schedule(); > + spin_lock_irq(&semaphore_lock); > + } > for (;;) { > > The only difference between the above and the "goto inside" variant is > really that the above sets "tsk->state = TASK_INTERRUPTIBLE;" just once > per loop, not twice as the "inside" case did. So if we happened to get an > exclusive wakeup at just the right point, we won't go to sleep again and > miss it. This still causes stuck processes. There's a bonnie++ test which hammers pipes. It's quite easy to get four instances of bonnie++ stuck on a pipe semaphore. This is with your suggested change applied to both __down and __down_interruptible, BTW. Switching both functions to use non-exclusive waitqueues didn't help. Still missing a wakeup somewhere. Moving on to this version: void __down(struct semaphore * sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); tsk->state = TASK_UNINTERRUPTIBLE; add_wait_queue_exclusive(&sem->wait, &wait); spin_lock_irq(&semaphore_lock); if (sem->sleepers) goto inside; for (;;) { /* * Add "everybody else" into it. They aren't * playing, because we own the spinlock. */ if (!atomic_add_negative(sem->sleepers, &sem->count)) { sem->sleepers = 0; break; } sem->sleepers = 1; /* us - see -1 above */ inside: spin_unlock_irq(&semaphore_lock); schedule(); tsk->state = TASK_UNINTERRUPTIBLE; spin_lock_irq(&semaphore_lock); } spin_unlock_irq(&semaphore_lock); remove_wait_queue(&sem->wait, &wait); tsk->state = TASK_RUNNING; wake_up(&sem->wait); } The only difference here is that we're not allowing `sem->sleepers' to take the value '2' outside the spinlock. But it still turns bonnie++ into Rip van Winkle. Next step is to move the waitqueue and wakeup operations so they're inside the spinlock. Nope. That doesn't work either. Next step is to throw away the semaphore_lock and use the sem->wait lock instead. That _does_ work. This is probably just a fluke - it synchronises the waker with the sleepers and we get lucky. But at least it's now benchmarkable. It works well. Kernel build times are unaffected. bw_tcp is down a few percent. Worth pursuing. David, could you please verify that the patch at http://www.uow.edu.au/~andrewm/linux/semaphore.patch still fixes the starvation problem? > But these things are very subtle. The current semaphore algorithm was > basically perfected over a week of some serious thinking. The fairness > change should similarly get a _lot_ of attention. It's way too easy to > miss things. Agree. The fact that this algorithm can work while random CPUs are asynchronously incrementing and decrementing sem->count makes analysis and understanding a _lot_ harder. There's a lot more work needed here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] semaphore fairness patch against test11-pre6 2000-11-19 12:51 ` Andrew Morton @ 2000-11-19 18:46 ` Linus Torvalds 2000-11-20 13:39 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Linus Torvalds @ 2000-11-19 18:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Rohland, David Mansfield, lkml On Sun, 19 Nov 2000, Andrew Morton wrote: > > I don't see a path where David's patch can cause a lost wakeup in the > way you describe. Basically, if there are two up() calls, they might end up waking up only one process, because the same process goes to sleep twice. That's wrong. It should wake up two processes. However, thinking about it more, that's obviously possible only for semaphores that are used for more than just mutual exclusion, and basically nobody does that anyway. > Next step is to move the waitqueue and wakeup operations so they're > inside the spinlock. Nope. That doesn't work either. > > Next step is to throw away the semaphore_lock and use the sem->wait > lock instead. That _does_ work. This is probably just a > fluke - it synchronises the waker with the sleepers and we get lucky. Yes, especially on a two-cpu machine that kind of synchronization can basically end up hiding real bugs. I'll think about this some more. One thing I noticed is that the "wake_up(&sem->wait);" at the end of __down() is kind of bogus: we don't actually want to wake anybody up at that point at all, it's just that if we don't wake anybody up we'll end up having "sem = 0, sleeper = 0", and when we unlock the semaphore the "__up()" logic won't trigger, and we won't ever wake anybody up. That's just incredibly bogus. Instead of the "wake_up()" at the end of __down(), we should have something like this at the end of __down() instead: ... for-loop ... } tsk->state = TASK_RUNNING; remove_wait_queue(&sem->wait, &wait); /* If there are others, mark the semaphore active */ if (wait_queue_active(&sem_wait)) { atomic_dec(&sem->count); sem->sleepers = 1; } spin_unlock_irq(&semaphore_lock); } which would avoid an unnecessary reschedule, and cause the wakeup to happen at the proper point, namely "__up()" when we release the semaphore. I suspect this may be part of the trouble with the "sleepers" count playing: we had these magic rules that I know I thought about when the code was written, but that aren't really part of the "real" rules. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] semaphore fairness patch against test11-pre6 2000-11-19 18:46 ` Linus Torvalds @ 2000-11-20 13:39 ` Andrew Morton 0 siblings, 0 replies; 7+ messages in thread From: Andrew Morton @ 2000-11-20 13:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christoph Rohland, David Mansfield, lkml Linus Torvalds wrote: > > ... > > I'll think about this some more. One thing I noticed is that the > "wake_up(&sem->wait);" at the end of __down() is kind of bogus: we don't > actually want to wake anybody up at that point at all, it's just that if > we don't wake anybody up we'll end up having "sem = 0, sleeper = 0", and > when we unlock the semaphore the "__up()" logic won't trigger, and we > won't ever wake anybody up. That's just incredibly bogus. There's another reason why we need the wakeup in the tail of __down(): 1 void __down(struct semaphore * sem) 2 { 3 struct task_struct *tsk = current; 4 DECLARE_WAITQUEUE(wait, tsk); 5 tsk->state = TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE; 6 add_wait_queue_exclusive(&sem->wait, &wait); 7 8 spin_lock_irq(&semaphore_lock); 9 sem->sleepers++; 10 for (;;) { 11 int sleepers = sem->sleepers; 12 13 /* 14 * Add "everybody else" into it. They aren't 15 * playing, because we own the spinlock. 16 */ 17 if (!atomic_add_negative(sleepers - 1, &sem->count)) { 18 sem->sleepers = 0; 19 break; 20 } 21 sem->sleepers = 1; /* us - see -1 above */ 22 spin_unlock_irq(&semaphore_lock); 23 24 schedule(); 25 tsk->state = TASK_UNINTERRUPTIBLE; 26 spin_lock_irq(&semaphore_lock); 27 } 28 spin_unlock_irq(&semaphore_lock); 29 remove_wait_queue(&sem->wait, &wait); 30 tsk->state = TASK_RUNNING; 31 wake_up(&sem->wait); 32 } Suppose two tasks A and B are sleeping on the semaphore and somebody does an up(). Task A wakes and sets TASK_UNINTERRUPTIBLE. Task A then executes statements 26, 11, 17, 18, 19, 28 and 29 while it's on the head of the exclusive waitqueue in state TASK_UNINTERRUPTIBLE. If during this, another CPU does an up() on the semaphore the wakeup will be directed to Task A and will be lost. Hence Task A must do the wakeup once it has removed itself from the waitqueue just in case this happened. Did anyone ever mention that exclusive wakeups should atomically remove tasks from the waitqueue? :) > Instead of the "wake_up()" at the end of __down(), we should have > something like this at the end of __down() instead: > > ... for-loop ... > } > tsk->state = TASK_RUNNING; > remove_wait_queue(&sem->wait, &wait); > > /* If there are others, mark the semaphore active */ > if (wait_queue_active(&sem_wait)) { > atomic_dec(&sem->count); > sem->sleepers = 1; > } > spin_unlock_irq(&semaphore_lock); > } > > which would avoid an unnecessary reschedule, and cause the wakeup to > happen at the proper point, namely "__up()" when we release the > semaphore. This won't help, because in the failure scenario we never get to exit the 'for' loop. Read on... Here's why the `goto inside' thingy isn't working: Assume the semaphore is free: Task A does down(): ------------------- count->0 sleepers=0 Task B does down(): ------------------- count: 0 -> -1 sleepers: 0->1 if (sleepers > 1) is false atomic_add_negative(1-1, -1): count -1 -> -1. `if' fails. sleepers: 1->1 schedule() Task C does down(): ------------------- count: -1 -> -2 sleepers: 1->2 if (sleepers > 1) is true schedule() Someone does up(): ------------------ (reminder: count=-2, sleepers=2) count: -2 -> -1 wakeup() -FINISHED- Task C, CPU0 CPU1 wakes up tsk->state = TASK_UNINTERRUPTIBLE spin_lock down() count: -1->-2 spins.... (reminder: sleepers=2, count=-2) atomic_add_negative(2-1, -2): count->-1 aargh! `if' fails! sleepers->1 spin_unlock() schedule() unspins.... sem->sleepers++ sleepers: 1->2 if (sleepers > 1) is true aargh! atomic_add_negative test would have succeeded! spin_unlock() schedule() ** Nobody woke up ** Someone does up(), but assume CPU1 gets the lock ------------------------------------------------ (reminder: count=-2, sleepers=2) count: -2 -> -1 wakeup() -FINISHED- Task C, CPU0 CPU1 wakes up tsk->state = TASK_UNINTERRUPTIBLE down() count: -1->-2 spin_lock spins... sem->sleepers++ sleepers: 2->3 if (sleepers > 1) is true spin_unlock() schedule() unspins... (reminder: sleepers=3, count=-2) atomic_add_negative(3-1, -2): count->0 'if' succeeds sleepers->1 spin_unlock() schedule() ** OK, that worked, and it was "fair" ** Here's the story with the current (test11) implementation ========================================================= Task A does down(): ------------------- count->0 sleepers=0 Task B does down(): ------------------- count: 0 -> -1 sleepers: 0->1 atomic_add_negative: count -1 -> -1. false. sleepers->1 schedule() Task C does down(): ------------------- count: -1 -> -2 sleepers: 1->2 atomic_add_negative: count -2 -> -1. false sleepers->1 schedule() Someone does up(): ------------------ count: -1 -> 0 wakeup() -FINISHED- Task C, CPU0 CPU1 wakes up tsk->state = TASK_UNINTERRUPTIBLE spin_lock down() count: 0 -> -1 spins.... (reminder: count=-1, sleepers=1) atomic_add_negative(1-1, -1): count->-1 `if' fails. sleepers->1 spin_unlock() schedule() unspins... sem->sleepers++: 1->2 atomic_add_negative(2-1, -1): count: -1->0 'if' is true. sleepers->0 break <runs> ** OK, that worked too ** > I suspect this may be part of the trouble with the "sleepers" count > playing: we had these magic rules that I know I thought about when the > code was written, but that aren't really part of the "real" rules. Linus, this code is too complex. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2000-11-20 14:10 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2000-11-18 1:01 [PATCH] semaphore fairness patch against test11-pre6 David Mansfield 2000-11-18 9:45 ` Christoph Rohland 2000-11-19 1:12 ` Andrew Morton 2000-11-19 1:47 ` Linus Torvalds 2000-11-19 12:51 ` Andrew Morton 2000-11-19 18:46 ` Linus Torvalds 2000-11-20 13:39 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox