From: Andrew Morton <andrewm@uow.edu.au>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: Christoph Rohland <cr@sap.com>,
David Mansfield <lkml@dm.ultramaster.com>,
lkml <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] semaphore fairness patch against test11-pre6
Date: Sun, 19 Nov 2000 23:51:00 +1100 [thread overview]
Message-ID: <3A17CCB4.19416026@uow.edu.au> (raw)
In-Reply-To: <3A172918.BD09CA9@uow.edu.au> <Pine.LNX.4.10.10011181732080.1413-100000@penguin.transmeta.com>
Linus Torvalds wrote:
>
> ...
> But the algorithm itself should allow for other values. In fact, I think
> that you'll find that it works fine if you switch to non-exclusive
> wait-queues, and the only reason you see the repeatable D states is
> exactly the case where we didn't "take" the semaphore even though we were
> awake, and that basically makes us an exclusive process that didn't react
> to an exclusive wakeup.
>
> (Think of it this way: with the "inside" patch, the process does
>
> tsk->state = TASK_INTERRUPTIBLE;
>
> twice, even though there was only one semaphore that woke it up: we
> basically "lost" a wakeup event, not because "sleepers" cannot be 2, but
> because we didn't pick up the wakeup that we might have gotten.
I don't see a path where David's patch can cause a lost wakeup in the
way you describe.
> Instead of the "goto inside", how about just doing it without the "double
> sleep", and doing something like
>
> tsk->state = TASK_INTERRUPTIBLE;
> add_wait_queue_exclusive(&sem->wait, &wait);
>
> spin_lock_irq(&semaphore_lock);
> sem->sleepers ++;
> + if (sem->sleepers > 1) {
> + spin_unlock_irq(&semaphore_lock);
> + schedule();
> + spin_lock_irq(&semaphore_lock);
> + }
> for (;;) {
>
> The only difference between the above and the "goto inside" variant is
> really that the above sets "tsk->state = TASK_INTERRUPTIBLE;" just once
> per loop, not twice as the "inside" case did. So if we happened to get an
> exclusive wakeup at just the right point, we won't go to sleep again and
> miss it.
This still causes stuck processes. There's a bonnie++ test which hammers
pipes. It's quite easy to get four instances of bonnie++ stuck on a pipe
semaphore. This is with your suggested change applied to both __down and
__down_interruptible, BTW. Switching both functions to use non-exclusive
waitqueues didn't help. Still missing a wakeup somewhere.
Moving on to this version:
void __down(struct semaphore * sem)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
tsk->state = TASK_UNINTERRUPTIBLE;
add_wait_queue_exclusive(&sem->wait, &wait);
spin_lock_irq(&semaphore_lock);
if (sem->sleepers)
goto inside;
for (;;) {
/*
* Add "everybody else" into it. They aren't
* playing, because we own the spinlock.
*/
if (!atomic_add_negative(sem->sleepers, &sem->count)) {
sem->sleepers = 0;
break;
}
sem->sleepers = 1; /* us - see -1 above */
inside:
spin_unlock_irq(&semaphore_lock);
schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
spin_unlock_irq(&semaphore_lock);
remove_wait_queue(&sem->wait, &wait);
tsk->state = TASK_RUNNING;
wake_up(&sem->wait);
}
The only difference here is that we're not allowing `sem->sleepers'
to take the value '2' outside the spinlock. But it still turns
bonnie++ into Rip van Winkle.
Next step is to move the waitqueue and wakeup operations so they're
inside the spinlock. Nope. That doesn't work either.
Next step is to throw away the semaphore_lock and use the sem->wait
lock instead. That _does_ work. This is probably just a
fluke - it synchronises the waker with the sleepers and we get lucky.
But at least it's now benchmarkable. It works well. Kernel build
times are unaffected. bw_tcp is down a few percent. Worth pursuing.
David, could you please verify that the patch at
http://www.uow.edu.au/~andrewm/linux/semaphore.patch
still fixes the starvation problem?
> But these things are very subtle. The current semaphore algorithm was
> basically perfected over a week of some serious thinking. The fairness
> change should similarly get a _lot_ of attention. It's way too easy to
> miss things.
Agree. The fact that this algorithm can work while random CPUs are
asynchronously incrementing and decrementing sem->count makes
analysis and understanding a _lot_ harder.
There's a lot more work needed here.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
next prev parent reply other threads:[~2000-11-19 13:21 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2000-11-18 1:01 [PATCH] semaphore fairness patch against test11-pre6 David Mansfield
2000-11-18 9:45 ` Christoph Rohland
2000-11-19 1:12 ` Andrew Morton
2000-11-19 1:47 ` Linus Torvalds
2000-11-19 12:51 ` Andrew Morton [this message]
2000-11-19 18:46 ` Linus Torvalds
2000-11-20 13:39 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3A17CCB4.19416026@uow.edu.au \
--to=andrewm@uow.edu.au \
--cc=cr@sap.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lkml@dm.ultramaster.com \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox