public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* rwsem-scale.patch
@ 2004-04-15 15:47 David Howells
  0 siblings, 0 replies; only message in thread
From: David Howells @ 2004-04-15 15:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Hi Andrew,

Please don't apply the rwsem-scale patch as is. It has a race in it.

The problem is this: once the waiting flag has been cleared in a struct
rwsem_waiter by __rwsem_do_wake() [waiter->flags = 0;], the waiter record is
liable to being deallocated by the down_*() function that allocated it exiting
and the stack space holding that record being reclaimed.

So envision this:

 (1) A bunch of processes running on an SMP machine try to down_read() an
     rwsem, but are forced to wait.

 (2) The one of these processes is on another waitqueue for some reason or
     other.

 (3) The current holder of the rwsem performs an up_write(), which observes
     that the rwsem has waiters and invokes __rwsem_do_wake().

 (4) The algorithm grabs the rwsems's spinlock, moves the batch of
     processes that added themselves in (1) to its wakeup list and marks them
     all as "woken".

 (5) The algorithm adjusts the semaphore "counter" and drops the spinlock.

 (6) An interrupt happens on the CPU doing this causing it to go and do stuff
     for a while (it could even go and awaken some of the processess on the
     wake-up list built in (4)).

 (7) Something on a different CPU triggers the alternate waitqueue mentioned
     in (2). This wakes up one of the processes linked into the wake-up list.

 (8) That process then continues running: it leaves rwsem_down_read_failed(),
     and discards the list_head that is _still_ linked into the wake-up list
     built in (4). This is permitted because the waiting flag has been
     cleared.

 (9) The first CPU resumes processing, and then oopses because one of the
     records linked into the wake-up list has now been reused for some other
     process.

You might be able to achieve much of the desired effect by only walking the
list once. If you look at __rwsem_do_wake(), you can see it walks the list
twice after the readers_only label. This is not really necessary. It could
adjust the rwsem counter after the second loop.

Actually, I'm not sure my algorithm is entirely safe either... it should
probably keep the refcount incremented on a task struct from before it's
queued in rwsem_down_failed_common() until after the wakeup() call in
__rwsem_do_wake().

If you do that, and move both the flag clearing and the wakeup() call outside
of the locked section, that will probably work.

David

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2004-04-16 13:45 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-15 15:47 rwsem-scale.patch David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox